[00:02:35] (03PS1) 10Legoktm: extdist: Drop pre-stretch support [puppet] - 10https://gerrit.wikimedia.org/r/560957 [02:51:31] PROBLEM - Check systemd state on ores1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:04:31] PROBLEM - ores_workers_running on ores1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [03:13:03] RECOVERY - Check systemd state on ores1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:13:27] RECOVERY - ores_workers_running on ores1001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [04:16:46] 10Operations, 10Maps, 10Discovery-Search (Current work): Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 (10Arjunaraoc) @Mathew.onipe @Gehel Thanks for completing this task, I am able to get my recent OSM edits reflect on w... [10:06:43] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3061.esams.wmnet [10:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:46] !log powercycle cp3061 - mgmt serial console not showing a working tty, no ssh [10:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:45] RECOVERY - Host cp3061 is UP: PING OK - Packet loss = 0%, RTA = 83.37 ms [10:57:03] !log repool cp3061 T238305 [10:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:10] T238305: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 [11:13:56] !log restarted wikibugs to fix phab irc notifications [11:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:49] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10ema) [11:50:29] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 31882376 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:52:15] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 55880 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:45:53] (03PS1) 10Ladsgroup: Offboard Tim [puppet] - 10https://gerrit.wikimedia.org/r/560972 [12:47:47] (03CR) 10jerkins-bot: [V: 04-1] Offboard Tim [puppet] - 10https://gerrit.wikimedia.org/r/560972 (owner: 10Ladsgroup) [12:59:07] (03CR) 10Peachey88: "Can I suggest a slightly more descriptive commit message, example might be something like "Off-boarding Tim Eulitz"" [puppet] - 10https://gerrit.wikimedia.org/r/560972 (owner: 10Ladsgroup) [13:11:09] PROBLEM - MD RAID on ms-be2035 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:11:10] ACKNOWLEDGEMENT - MD RAID on ms-be2035 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T241534 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:11:14] 10Operations, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T241534 (10ops-monitoring-bot) [13:22:24] (03CR) 10MarcoAurelio: [C: 04-1] "See inline comments. I also agree with Peachey88. There's also an apparently easy way to offboard users documented at https://w.wiki/Ech v" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/560972 (owner: 10Ladsgroup) [13:30:55] PROBLEM - HP RAID on ms-be2035 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:2 - Failed: 2I:4:1 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:30:57] ACKNOWLEDGEMENT - HP RAID on ms-be2035 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:2 - Failed: 2I:4:1 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T241535 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:31:01] 10Operations, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T241535 (10ops-monitoring-bot) [13:34:41] (03PS3) 10Gehel: wdqs: use RecentChanges API for updates on all WDQS servers [puppet] - 10https://gerrit.wikimedia.org/r/560922 (https://phabricator.wikimedia.org/T241410) [13:37:13] (03CR) 10Gehel: [C: 03+2] wdqs: use RecentChanges API for updates on all WDQS servers [puppet] - 10https://gerrit.wikimedia.org/r/560922 (https://phabricator.wikimedia.org/T241410) (owner: 10Gehel) [13:37:33] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T241535 (10Peachey88) [14:04:20] PROBLEM - Device not healthy -SMART- on ms-be2035 is CRITICAL: cluster=swift device={cciss,0,cciss,1,cciss,10,cciss,11,cciss,12,cciss,13,cciss,2,cciss,3,cciss,4,cciss,5,cciss,6,cciss,7,cciss,8,cciss,9} instance=ms-be2035:9100 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2035&var-datasource=codfw+prometheus/ops [14:31:25] (03PS2) 10Ladsgroup: Offboard Tim Eulitz [puppet] - 10https://gerrit.wikimedia.org/r/560972 [14:32:33] (03CR) 10Ladsgroup: "> Patch Set 1: Code-Review-1" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/560972 (owner: 10Ladsgroup) [15:18:34] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 47 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [15:24:26] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 28 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [17:47:59] Hey all - going to deploy a quick security fix for T241410 (https://gerrit.wikimedia.org/r/560978) in a few minutes. [17:58:18] !log sbassett@deploy1001 Synchronized php-1.35.0-wmf.11/extensions/EventBus/includes/EventFactory.php: Security fix for T241410 (duration: 00m 56s) [17:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:49] (03CR) 10MarcoAurelio: [C: 03+1] "Looks good to me now. Waiting on ldap/ops people to review and merge." [puppet] - 10https://gerrit.wikimedia.org/r/560972 (owner: 10Ladsgroup) [18:48:47] Hi, Can someone having +2 in integration/config merge https://gerrit.wikimedia.org/r/#/c/integration/config/+/560982/? [18:49:39] Jayprakash12345: a cheque would help ;-) [18:50:20] I'm not sure there's anyone right now though [18:52:41] hauskatze: Ok, Thank you! May be anyone will merge when they will online :) [18:55:25] hashar was online a while ago I think [18:55:33] you may be lucky :) [19:03:26] (03PS1) 10Gehel: wdqs: enable async_import on eqiad public cluster [puppet] - 10https://gerrit.wikimedia.org/r/560987 (https://phabricator.wikimedia.org/T241410) [19:03:46] (03PS2) 10Gehel: wdqs: enable async_import on eqiad public cluster [puppet] - 10https://gerrit.wikimedia.org/r/560987 (https://phabricator.wikimedia.org/T241410) [19:05:44] (03CR) 10Gehel: [C: 03+2] "PCC looks happy" [puppet] - 10https://gerrit.wikimedia.org/r/560987 (https://phabricator.wikimedia.org/T241410) (owner: 10Gehel) [20:51:28] PROBLEM - traffic_server tls process restarted on cp1083 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqiad+prometheus/ops&var-instance=cp1083&var-layer=tls [20:53:44] RECOVERY - WDQS high update lag on wdqs1007 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.154e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:29:35] !log repooling wdqs1007 [22:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log