[00:12:11] (03CR) 10Gergő Tisza: [C: 03+2] Enable session-ip log channel on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633276 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [00:12:58] (03Merged) 10jenkins-bot: Enable session-ip log channel on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633276 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [00:13:53] !log built prometheus-nutcracker-exporter for buster and imported on apt1001 (0.2+nmu1) [00:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:28] 10Operations, 10serviceops, 10Patch-For-Review: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Dzahn) >>! In T264991#6533968, @Dzahn wrote: > - prometheus-nutcracker-exporter I rebuilt prometheus-nutcracker-exporter for buster on deneb as version 0.2+n... [00:18:51] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:633276|Enable session-ip log channel on eswiki (T264799)]] (duration: 00m 55s) [00:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:56] T264799: Log when a request with the same user session comes from a different IP - https://phabricator.wikimedia.org/T264799 [00:44:36] (03PS1) 10Gergő Tisza: Enable session-ip log channel on all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633277 (https://phabricator.wikimedia.org/T264799) [00:50:03] (03CR) 10Gergő Tisza: [C: 03+2] Enable session-ip log channel on all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633277 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [00:50:50] (03Merged) 10jenkins-bot: Enable session-ip log channel on all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633277 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [00:54:49] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:633277|Enable session-ip log channel on all but enwiki (T264799)]] (duration: 01m 01s) [00:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:56] T264799: Log when a request with the same user session comes from a different IP - https://phabricator.wikimedia.org/T264799 [01:15:24] (03PS1) 10Gergő Tisza: Enable session-ip log channel everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633281 (https://phabricator.wikimedia.org/T264799) [01:21:37] (03PS1) 10Dzahn: wikistats: add a recursive diff to the output of the deploy script [puppet] - 10https://gerrit.wikimedia.org/r/633282 [01:24:54] (03CR) 10Gergő Tisza: [C: 03+2] Enable session-ip log channel everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633281 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [01:25:36] (03Merged) 10jenkins-bot: Enable session-ip log channel everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633281 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [01:32:07] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:633281|Enable session-ip log channel everywhere (T264799)]] (duration: 00m 59s) [01:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:14] T264799: Log when a request with the same user session comes from a different IP - https://phabricator.wikimedia.org/T264799 [02:37:12] (03PS2) 10Dzahn: wikistats: echo a recursive diff to the output of the deploy script [puppet] - 10https://gerrit.wikimedia.org/r/633282 [02:37:46] (03CR) 10Dzahn: [C: 03+2] "cloud-only" [puppet] - 10https://gerrit.wikimedia.org/r/633282 (owner: 10Dzahn) [02:58:12] (03PS1) 10Dzahn: wikistats: remove php7.0 pre-buster support [puppet] - 10https://gerrit.wikimedia.org/r/633286 [02:58:31] (03CR) 10jerkins-bot: [V: 04-1] wikistats: remove php7.0 pre-buster support [puppet] - 10https://gerrit.wikimedia.org/r/633286 (owner: 10Dzahn) [03:05:21] (03PS2) 10Dzahn: wikistats: rm php7.0 pre-buster support, make PHP version parameter [puppet] - 10https://gerrit.wikimedia.org/r/633286 [03:05:40] (03CR) 10jerkins-bot: [V: 04-1] wikistats: rm php7.0 pre-buster support, make PHP version parameter [puppet] - 10https://gerrit.wikimedia.org/r/633286 (owner: 10Dzahn) [03:06:22] (03PS3) 10Dzahn: wikistats: rm php7.0 pre-buster support, make PHP version parameter [puppet] - 10https://gerrit.wikimedia.org/r/633286 [03:08:17] (03PS4) 10Dzahn: wikistats: rm php7.0 pre-buster support, make PHP version parameter [puppet] - 10https://gerrit.wikimedia.org/r/633286 [03:39:37] (03PS1) 10Dzahn: wikistats: redo the way cronjobs are setup, add parameter to absent [puppet] - 10https://gerrit.wikimedia.org/r/633288 [03:42:09] (03PS2) 10Dzahn: wikistats: redo the way cronjobs are setup, add parameter to absent [puppet] - 10https://gerrit.wikimedia.org/r/633288 [03:43:12] (03CR) 10jerkins-bot: [V: 04-1] wikistats: redo the way cronjobs are setup, add parameter to absent [puppet] - 10https://gerrit.wikimedia.org/r/633288 (owner: 10Dzahn) [04:32:27] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 192 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:38:03] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 40 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:06:17] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 144 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:09:39] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 6 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:59:45] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 151 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201010T0700) [07:01:21] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 10 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:31:22] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Aklapper) [08:36:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:38:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:56:20] (03PS1) 10Elukey: Remove analytics1045 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/633296 (https://phabricator.wikimedia.org/T255140) [11:56:54] (03CR) 10Elukey: [C: 03+2] Remove analytics1045 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/633296 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [12:33:43] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [12:34:59] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [14:49:40] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10bd808) Should this task be merged with {T245757} somehow? [15:54:04] (03PS1) 10Andrew Bogott: backy2: throttle bandwidth for reading and writing [puppet] - 10https://gerrit.wikimedia.org/r/633306 (https://phabricator.wikimedia.org/T260692) [15:54:47] (03CR) 10Andrew Bogott: [C: 03+2] backy2: throttle bandwidth for reading and writing [puppet] - 10https://gerrit.wikimedia.org/r/633306 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [16:00:11] PROBLEM - SSH on analytics1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:12:35] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:13:05] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:01:29] RECOVERY - SSH on analytics1046.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:08:19] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [18:09:51] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:04:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:06:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:56:45] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:57:19] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down