[00:00:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [00:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:37] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/631315 (owner: 10Dzahn) [00:16:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [00:16:05] 10Operations: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `testvm4001.ulsfo.wmnet` - testvm4001.ulsfo.wmnet (**WARN**) - **Failed do... [00:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:05] (03PS4) 10Dzahn: toolforge/grid: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631315 [00:18:19] (03CR) 10Dzahn: "ah yes, rebase needed because toolforge/services/basic.pp was deleted earlier today. should be fixed now" [puppet] - 10https://gerrit.wikimedia.org/r/631315 (owner: 10Dzahn) [00:20:42] (03PS12) 10Dzahn: labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 [00:20:45] (03CR) 10Dzahn: labstore: add data types and some other style fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [00:22:39] (03CR) 10Bstorm: "It seems...upset https://puppet-compiler.wmflabs.org/compiler1003/25802/" [puppet] - 10https://gerrit.wikimedia.org/r/631315 (owner: 10Dzahn) [00:28:45] (03PS13) 10Dzahn: labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 [00:29:30] (03CR) 10Dzahn: "also have to use " Hash[String, Hash[String, Variant[Integer,String]]] $drbd_resource_config " now because of the port as actual number" [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [00:31:48] (03CR) 10Dzahn: [C: 04-1] "arr.. still not there :( https://puppet-compiler.wmflabs.org/compiler1003/25804/labstore1004.eqiad.wmnet/change.labstore1004.eqiad.wmnet.e" [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [00:35:15] (03PS5) 10Dzahn: toolforge/grid: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631315 [00:36:32] (03CR) 10Dzahn: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/631315 (owner: 10Dzahn) [00:39:38] (03PS6) 10Dzahn: toolforge/grid: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631315 [00:42:28] (03PS7) 10Dzahn: toolforge/grid: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631315 [00:45:27] (03CR) 10Dzahn: [V: 03+1] "after fixing multiple issues, finally looking good: https://puppet-compiler.wmflabs.org/compiler1002/25807/" [puppet] - 10https://gerrit.wikimedia.org/r/631315 (owner: 10Dzahn) [02:38:10] (03PS1) 10Andrew Bogott: wmcs-backup-instances: add missing argument [puppet] - 10https://gerrit.wikimedia.org/r/633049 (https://phabricator.wikimedia.org/T260692) [02:39:20] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup-instances: add missing argument [puppet] - 10https://gerrit.wikimedia.org/r/633049 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [05:18:09] (03PS1) 10Marostegui: control-mariadb*: Bump version [software] - 10https://gerrit.wikimedia.org/r/633053 [05:18:48] (03CR) 10Marostegui: [C: 03+2] control-mariadb*: Bump version [software] - 10https://gerrit.wikimedia.org/r/633053 (owner: 10Marostegui) [06:07:20] PROBLEM - Host lvs3005 is DOWN: PING CRITICAL - Packet loss = 100% [06:07:36] PROBLEM - Host ncredir-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [06:08:00] checking [06:08:02] RECOVERY - Host ncredir-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 92.63 ms [06:08:06] RECOVERY - Host lvs3005 is UP: PING OK - Packet loss = 0%, RTA = 83.37 ms [06:08:31] the host didn't reboot [06:08:42] Uh [06:08:52] I guess network glitch? [06:09:15] kern.log reports changes on network link status? [06:09:30] nope [06:14:27] don't see anything wrong on the network neither [06:49:20] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:50:29] (03PS1) 10Marostegui: es*: Unify new es hosts config [puppet] - 10https://gerrit.wikimedia.org/r/633133 (https://phabricator.wikimedia.org/T261717) [06:56:44] (03CR) 10Marostegui: [C: 03+2] es*: Unify new es hosts config [puppet] - 10https://gerrit.wikimedia.org/r/633133 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201009T0700) [07:01:12] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 130, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:01:12] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:01:38] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6461/IPv4: Idle - Zayo, AS6461/IPv6: Idle - Zayo https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:02:20] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:04:12] PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:47] hm: elasticsearch_6@production-search-eqiad.service: Main process exited, code=killed, status=11/SEGV ^ :/ [07:09:43] ouch [07:09:46] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:46] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:10:08] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 53, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:10:52] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:11:55] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Joe) I think some form of ratelimiting for that should be present in restbas... [07:12:51] elukey: seeing "Killing elasticsearch[e:3361 due to hardware memory corruption fault at 7ff100d68000" should we depool this machine? [07:13:27] checking [07:13:30] (03PS1) 10Elukey: admin: add user lexnasser back to active state [puppet] - 10https://gerrit.wikimedia.org/r/633135 (https://phabricator.wikimedia.org/T265071) [07:14:56] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:06] RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:13] dcausse: I checked the DELL's DRAC and I don't see any memory errors reported (like DIMM bank on fire etc.. :D) [07:16:52] so we have two options [07:17:09] 1) we leave the host running, and if it fails again we completely depool it (relying on {1}[Hardware Error]: It has been corrected by h/w and requires no further action) [07:17:21] 2) we depool it directly asking for a memory test from dcops [07:17:39] likely 2 is better anyway, but for the time being we could see if the host keeps running [07:17:55] elukey: ok I'm filing a task [07:18:00] ack [07:21:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/633135 (https://phabricator.wikimedia.org/T265071) (owner: 10Elukey) [07:21:44] (03CR) 10Elukey: "If restoring the ssh key is ok I'll also take care of the other perms (ldap/kerberos). Lex already started working so he'd need access to " [puppet] - 10https://gerrit.wikimedia.org/r/633135 (https://phabricator.wikimedia.org/T265071) (owner: 10Elukey) [07:24:06] 10Operations, 10Discovery-Search: Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10dcausse) [07:25:33] (03CR) 10Elukey: [C: 03+2] admin: add user lexnasser back to active state [puppet] - 10https://gerrit.wikimedia.org/r/633135 (https://phabricator.wikimedia.org/T265071) (owner: 10Elukey) [07:26:30] 10Operations, 10ops-eqiad, 10Discovery-Search: Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10elukey) [07:26:35] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10MoritzMuehlenhoff) The "leila" account also needs to be removed from the wmf LDAP group. [07:31:54] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10MoritzMuehlenhoff) >>! In T264888#6529767, @jbond wrote: >however i, like @BBlack, prefer reject to drop if possible. As such it would be nice to be good... [07:32:56] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10elukey) @lexnasser you should now be able to ssh to the stat100x hosts (notebooks are not there anymore, deprecated, we copied your things... [07:34:01] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [07:34:01] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:34] !log installing xen security updates for buster (libs only) [07:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:38] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Marostegui) [07:40:49] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10Kormat) >>! In T264472#6531651, @MoritzMuehlenhoff wrote: > The "leila" account also needs to be removed from the wmf LDAP group. Good catch, done. [07:41:02] (03CR) 10Elukey: [C: 03+1] "Looks good, before merging it is better to try the hdfs command to verify that it works (it should but it changed a little bit, so better " [puppet] - 10https://gerrit.wikimedia.org/r/631896 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [07:45:49] (03PS1) 10Muehlenhoff: Update cloudvirt Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/633137 [07:46:59] (03CR) 10Muehlenhoff: [C: 03+2] Update cloudvirt Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/633137 (owner: 10Muehlenhoff) [07:56:51] (03PS1) 10Elukey: Decommission analytics1044 from Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/633140 (https://phabricator.wikimedia.org/T255140) [07:58:20] (03CR) 10Elukey: [C: 03+2] Decommission analytics1044 from Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/633140 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [08:11:05] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [08:11:06] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:00] (03PS1) 10Marostegui: dbstore1005: Decrease buffer pool size [puppet] - 10https://gerrit.wikimedia.org/r/633142 [08:17:53] (03CR) 10Elukey: [C: 03+1] dbstore1005: Decrease buffer pool size [puppet] - 10https://gerrit.wikimedia.org/r/633142 (owner: 10Marostegui) [08:19:17] (03PS4) 10ArielGlenn: new util to display info about revisions for one or more pages from XML input [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/630267 (https://phabricator.wikimedia.org/T263319) [08:20:01] (03CR) 10Marostegui: [C: 03+2] dbstore1005: Decrease buffer pool size [puppet] - 10https://gerrit.wikimedia.org/r/633142 (owner: 10Marostegui) [08:22:32] !log Restart dbstore1005 mysql to pick up new buffer pool sizes [08:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:09] (03PS1) 10Muehlenhoff: Remove access for rush [puppet] - 10https://gerrit.wikimedia.org/r/633144 [08:30:56] (03CR) 10jerkins-bot: [V: 04-1] Remove access for rush [puppet] - 10https://gerrit.wikimedia.org/r/633144 (owner: 10Muehlenhoff) [08:33:36] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I would like to know what the Traffic team is doing or planning on doing about this investigation at the moment. Now that I've narrow... [08:34:04] (03PS2) 10Muehlenhoff: Remove access for rush [puppet] - 10https://gerrit.wikimedia.org/r/633144 [08:43:01] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: refresh CIDR for vlan 2107 - cloud-gw-transport-codfw [puppet] - 10https://gerrit.wikimedia.org/r/633147 (https://phabricator.wikimedia.org/T263622) [08:43:01] 10Operations, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10LSobanski) Removing #DBA as there's nothing specific for us to do right now, do add us back if anything comes up. [08:45:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: refresh CIDR for vlan 2107 - cloud-gw-transport-codfw [puppet] - 10https://gerrit.wikimedia.org/r/633147 (https://phabricator.wikimedia.org/T263622) (owner: 10Arturo Borrero Gonzalez) [08:46:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:49:43] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for rush [puppet] - 10https://gerrit.wikimedia.org/r/633144 (owner: 10Muehlenhoff) [08:49:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:52:04] (03PS1) 10Elukey: cdh: increase retention for the RFA log appender [puppet] - 10https://gerrit.wikimedia.org/r/633148 [08:52:33] (03CR) 10Elukey: [C: 03+2] cdh: increase retention for the RFA log appender [puppet] - 10https://gerrit.wikimedia.org/r/633148 (owner: 10Elukey) [08:57:23] (03PS1) 10Ayounsi: Remove user rush [homer/public] - 10https://gerrit.wikimedia.org/r/633149 [08:58:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/633149 (owner: 10Ayounsi) [08:59:53] (03CR) 10Ayounsi: [C: 03+2] Remove user rush [homer/public] - 10https://gerrit.wikimedia.org/r/633149 (owner: 10Ayounsi) [09:00:20] (03Merged) 10jenkins-bot: Remove user rush [homer/public] - 10https://gerrit.wikimedia.org/r/633149 (owner: 10Ayounsi) [09:02:02] (03PS1) 10Muehlenhoff: Remove Chase from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/633150 [09:06:54] (03CR) 10Muehlenhoff: [C: 03+2] Remove Chase from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/633150 (owner: 10Muehlenhoff) [09:06:56] (03PS1) 10Elukey: Allow the hdfs user to run Yarn jobs in Hadoop clusters [puppet] - 10https://gerrit.wikimedia.org/r/633151 [09:07:03] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Gilles) @bblack when we last discussed the subject of this task in a meeting recently, you mentioned that replacing ats-tls (the "p... [09:07:18] !log remove user from all network devices [09:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:32] (03PS2) 10JMeybohm: eventgate-analytics: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632712 (https://phabricator.wikimedia.org/T264157) [09:08:10] (03CR) 10Joal: "LGTM - Thanks elukey :)" [puppet] - 10https://gerrit.wikimedia.org/r/633151 (owner: 10Elukey) [09:08:23] (03CR) 10Elukey: [C: 03+2] Allow the hdfs user to run Yarn jobs in Hadoop clusters [puppet] - 10https://gerrit.wikimedia.org/r/633151 (owner: 10Elukey) [09:11:09] (03PS1) 10Elukey: Remove incorrect banned user (hdfs) from Hadoop Yarn container's settings [puppet] - 10https://gerrit.wikimedia.org/r/633154 [09:11:31] (03CR) 10Elukey: [C: 03+2] Remove incorrect banned user (hdfs) from Hadoop Yarn container's settings [puppet] - 10https://gerrit.wikimedia.org/r/633154 (owner: 10Elukey) [09:16:16] (03CR) 10Jbond: [C: 03+1] toolforge/grid: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631315 (owner: 10Dzahn) [09:17:53] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: refresh external connection for neutron [puppet] - 10https://gerrit.wikimedia.org/r/633155 (https://phabricator.wikimedia.org/T261724) [09:18:05] (03PS10) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) [09:18:12] (03CR) 10Gehel: Introduce an interface for progress bars. (033 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [09:18:35] (03Abandoned) 10Gehel: extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 (owner: 10Gehel) [09:18:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: refresh external connection for neutron [puppet] - 10https://gerrit.wikimedia.org/r/633155 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [09:19:41] (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [09:24:52] PROBLEM - Check systemd state on an-worker1079 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:25:06] this is me --^ [09:25:24] 10Operations, 10LDAP-Access-Requests: Access to the Logstash for John Bolorinos - https://phabricator.wikimedia.org/T264918 (10Kormat) It is, yep, thanks. I just need the staff contact + contract end date now. [09:25:33] PROBLEM==elukey. it is known. [09:25:38] yeah correct [09:26:08] RECOVERY - Check systemd state on an-worker1079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:22] 10Operations, 10DBA, 10Data-Persistence, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [09:27:01] (03PS1) 10Jbond: Firewall: Change the default firewall rule fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/633156 (https://phabricator.wikimedia.org/T264888) [09:27:03] (03PS1) 10Jbond: Firewall: Change the default firewall rule cloud environment [puppet] - 10https://gerrit.wikimedia.org/r/633157 (https://phabricator.wikimedia.org/T264888) [09:29:59] (03CR) 10JMeybohm: [C: 03+2] eventgate-analytics: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632712 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [09:30:09] (03PS2) 10JMeybohm: eventgate-main: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632713 (https://phabricator.wikimedia.org/T264157) [09:30:58] (03PS2) 10JMeybohm: mathoid: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632714 (https://phabricator.wikimedia.org/T264157) [09:31:10] (03PS2) 10JMeybohm: mobileapps: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632715 (https://phabricator.wikimedia.org/T264157) [09:31:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:31:17] (03PS2) 10JMeybohm: proton: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632716 (https://phabricator.wikimedia.org/T264157) [09:31:43] (03PS2) 10JMeybohm: push-notifications: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632717 (https://phabricator.wikimedia.org/T264157) [09:31:52] (03PS2) 10JMeybohm: termbox: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632718 (https://phabricator.wikimedia.org/T264157) [09:31:57] (03PS2) 10JMeybohm: wikifeeds: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632719 (https://phabricator.wikimedia.org/T264157) [09:32:03] (03PS2) 10JMeybohm: zotero: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632720 (https://phabricator.wikimedia.org/T264157) [09:32:30] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:32:31] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:32:54] (03Merged) 10jenkins-bot: eventgate-analytics: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632712 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [09:33:31] (03CR) 10jerkins-bot: [V: 04-1] proton: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632716 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [09:35:46] (03CR) 10JMeybohm: [C: 03+2] eventgate-main: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632713 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [09:35:48] (03CR) 10JMeybohm: [C: 03+2] mathoid: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632714 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [09:37:45] (03Merged) 10jenkins-bot: eventgate-main: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632713 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [09:37:53] (03CR) 10jerkins-bot: [V: 04-1] mathoid: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632714 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [09:38:15] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [09:38:15] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [09:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:33] (03CR) 10Kormat: [C: 03+1] pontoon: use hiera.output [puppet] - 10https://gerrit.wikimedia.org/r/632921 (owner: 10Filippo Giunchedi) [09:40:49] (03CR) 10Kormat: [C: 03+1] "Nice :)" [puppet] - 10https://gerrit.wikimedia.org/r/632918 (owner: 10Filippo Giunchedi) [09:43:03] (03CR) 10Kormat: [C: 03+1] pontoon: read stack from stack.file [puppet] - 10https://gerrit.wikimedia.org/r/632919 (owner: 10Filippo Giunchedi) [09:43:58] (03CR) 10Kormat: [C: 03+1] pontoon: configure hiera based on the stack found on the filesystem [puppet] - 10https://gerrit.wikimedia.org/r/632920 (owner: 10Filippo Giunchedi) [09:45:58] (03PS1) 10Filippo Giunchedi: pontoon: set labs_tld and labs_site globally [puppet] - 10https://gerrit.wikimedia.org/r/633158 [09:47:31] !log roll restart of hadoop-yarn-nodemanager on all hadoop workers to pick up new settings [09:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:09] (03CR) 10Kormat: [C: 03+1] pontoon: set labs_tld and labs_site globally [puppet] - 10https://gerrit.wikimedia.org/r/633158 (owner: 10Filippo Giunchedi) [09:48:37] (03PS1) 10JMeybohm: admin: jayme dotfiles: Add helmfile aliases [puppet] - 10https://gerrit.wikimedia.org/r/633159 [09:49:46] (03CR) 10JMeybohm: [C: 03+2] admin: jayme dotfiles: Add helmfile aliases [puppet] - 10https://gerrit.wikimedia.org/r/633159 (owner: 10JMeybohm) [09:52:53] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: set labs_tld and labs_site globally [puppet] - 10https://gerrit.wikimedia.org/r/633158 (owner: 10Filippo Giunchedi) [09:53:56] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [09:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:47] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [09:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:57] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use hiera.output [puppet] - 10https://gerrit.wikimedia.org/r/632921 (owner: 10Filippo Giunchedi) [09:55:59] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: write the stack name once to the filesystem [puppet] - 10https://gerrit.wikimedia.org/r/632918 (owner: 10Filippo Giunchedi) [09:56:01] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: read stack from stack.file [puppet] - 10https://gerrit.wikimedia.org/r/632919 (owner: 10Filippo Giunchedi) [09:56:03] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: configure hiera based on the stack found on the filesystem [puppet] - 10https://gerrit.wikimedia.org/r/632920 (owner: 10Filippo Giunchedi) [10:09:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "The change (as discussed in the phab task) looks good to me overall." [puppet] - 10https://gerrit.wikimedia.org/r/633157 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [10:11:49] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [10:11:49] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [10:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:12] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [10:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:18] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [10:17:18] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [10:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:37] (03PS2) 10Muehlenhoff: Install ldap-replica200[34] as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/632648 (https://phabricator.wikimedia.org/T264388) [10:39:27] (03PS1) 10Elukey: Set up the new Analytics Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/633162 (https://phabricator.wikimedia.org/T255139) [10:41:10] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [10:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:16] dcausse: ^ I'll have a look when back from lunch [10:47:09] (03CR) 10Hnowlan: [C: 03+1] Use ubuntu 16.04 as buildsystem to be compatible with stretch [debs/envoyproxy] - 10https://gerrit.wikimedia.org/r/632479 (owner: 10JMeybohm) [10:49:03] (03CR) 10JMeybohm: [C: 03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/632714 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [10:49:56] (03CR) 10JMeybohm: [C: 03+2] Use ubuntu 16.04 as buildsystem to be compatible with stretch [debs/envoyproxy] - 10https://gerrit.wikimedia.org/r/632479 (owner: 10JMeybohm) [10:51:18] (03Merged) 10jenkins-bot: mathoid: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632714 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [10:52:02] (03CR) 10JMeybohm: [C: 03+2] mobileapps: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632715 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [10:52:19] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [10:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:15] (03Merged) 10jenkins-bot: mobileapps: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632715 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [10:58:51] (03CR) 10Elukey: [C: 03+2] Set up the new Analytics Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/633162 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey) [11:02:22] (03PS1) 10Elukey: Remove min disk available constraint from Hadoop test workers' settings [puppet] - 10https://gerrit.wikimedia.org/r/633164 (https://phabricator.wikimedia.org/T255139) [11:04:24] (03CR) 10Elukey: [C: 03+2] Remove min disk available constraint from Hadoop test workers' settings [puppet] - 10https://gerrit.wikimedia.org/r/633164 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey) [11:08:38] (03CR) 10Hnowlan: [C: 03+2] map::postgresql_common: make maps-admin chgrp toggle [puppet] - 10https://gerrit.wikimedia.org/r/632935 (https://phabricator.wikimedia.org/T263726) (owner: 10Hnowlan) [11:13:41] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [11:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:47] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' . [11:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:14] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [11:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:43] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25808/" [puppet] - 10https://gerrit.wikimedia.org/r/632648 (https://phabricator.wikimedia.org/T264388) (owner: 10Muehlenhoff) [11:24:07] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:26:07] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:30:21] (03PS5) 10ArielGlenn: new util to display info about revisions for one or more pages from XML input [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/630267 (https://phabricator.wikimedia.org/T263319) [11:30:23] (03PS1) 10ArielGlenn: bump version to 0.0.10 [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/633166 [11:30:38] (03PS1) 10ArielGlenn: version 0.0.10 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/633167 [11:38:06] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [11:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:07] (03PS1) 10Elukey: Avoid the analytics keytab for the Analytics Hadoop test master [puppet] - 10https://gerrit.wikimedia.org/r/633168 (https://phabricator.wikimedia.org/T255139) [11:43:45] (03CR) 10Elukey: [C: 03+2] Avoid the analytics keytab for the Analytics Hadoop test master [puppet] - 10https://gerrit.wikimedia.org/r/633168 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey) [11:48:02] (03PS1) 10Elukey: Revert "Avoid the analytics keytab for the Analytics Hadoop test master" [puppet] - 10https://gerrit.wikimedia.org/r/633079 [11:49:17] (03CR) 10Elukey: [C: 03+2] Revert "Avoid the analytics keytab for the Analytics Hadoop test master" [puppet] - 10https://gerrit.wikimedia.org/r/633079 (owner: 10Elukey) [11:51:32] 10Operations, 10Puppet, 10puppet-compiler, 10User-jbond: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10jbond) Also related https://tickets.puppetlabs.com/browse/PE-24280 [12:02:18] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/632716 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [12:02:33] (03CR) 10JMeybohm: [C: 03+2] push-notifications: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632717 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [12:06:17] (03CR) 10JMeybohm: [C: 03+2] proton: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632716 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [12:08:45] (03Merged) 10jenkins-bot: proton: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632716 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [12:08:47] (03Merged) 10jenkins-bot: push-notifications: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632717 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [12:13:40] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' . [12:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:23] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [12:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:23] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [12:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:24] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [12:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:42] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [12:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:52] (03PS11) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) [12:26:39] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10kostajh) >>! In T252391#6387745, @MMiller_WMF wrote: > @kostajh -- maybe we should do that, but I would like to hear from @ne... [12:27:21] (03PS1) 10Muehlenhoff: Add an apt proxy config for deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/633172 (https://phabricator.wikimedia.org/T262647) [12:30:53] (03CR) 10Gehel: Introduce an interface for progress bars. (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [12:33:23] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [12:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:36] 10Operations, 10Puppet, 10puppet-compiler, 10User-jbond: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10jbond) Need to check this further but it may be possible to switch to [[ https://tickets.puppetlabs.com/browse/PUP-9055 | puppet catalogue compile ]] [12:46:08] PROBLEM - Thanos compact has disappeared from Prometheus discovery on alert1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [12:47:26] PROBLEM - Thanos query has high gRPC client errors on alert1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [12:47:50] RECOVERY - Thanos compact has disappeared from Prometheus discovery on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [12:49:10] RECOVERY - Thanos query has high gRPC client errors on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [12:49:14] (03CR) 10JMeybohm: [C: 03+2] termbox: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632718 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [12:49:29] investigating what's up with the alerts above [12:50:19] hah, heavy query on prometheus codfw [12:51:22] I'm probably loading a bit of data while checking deployments. But nothing really out of normal I would guess [12:51:33] (03Merged) 10jenkins-bot: termbox: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632718 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [12:52:47] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [12:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:05] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' . [12:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:14] jayme: ack, thanks! yeah ATM not easy to know what could be causing it [12:57:42] also generally prometheus has safeguards to try and avoid oom, doesn't always work [12:59:04] 10Operations: wb_terms has been removed - https://phabricator.wikimedia.org/T265137 (10toan) [13:04:30] 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (mx2001 is blacklisted by abusix blacklist) - https://phabricator.wikimedia.org/T264504 (10Xqt) I tried to get a confirmation mail again and got it. It worked as expected now. Thanks a lot. [13:06:09] (03PS1) 10Filippo Giunchedi: pontoon: add ssl symlink unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/633179 [13:06:29] (03CR) 10jerkins-bot: [V: 04-1] pontoon: add ssl symlink unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/633179 (owner: 10Filippo Giunchedi) [13:06:51] 10Operations, 10Puppet, 10puppet-compiler, 10User-jbond: OKR: Worked required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) p:05Triage→03Medium [13:07:17] (03PS2) 10Filippo Giunchedi: pontoon: add ssl symlink unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/633179 [13:07:43] 10Operations, 10Puppet, 10puppet-compiler, 10User-jbond: OKR: Worked required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [13:07:46] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Investigate using the rich_data option to support Binary and binary_file for binary data - https://phabricator.wikimedia.org/T236481 (10jbond) [13:07:48] 10Operations, 10Puppet, 10puppet-compiler, 10User-jbond: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10jbond) [13:07:50] 10Operations, 10Puppet, 10User-jbond: require_package should mark packages as manually installed - https://phabricator.wikimedia.org/T195981 (10jbond) [13:09:12] (03CR) 10Kormat: [C: 03+1] pontoon: add ssl symlink unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/633179 (owner: 10Filippo Giunchedi) [13:09:21] 10Operations, 10Puppet, 10User-jbond: Update puppet infrastructure latest 5.5 version - https://phabricator.wikimedia.org/T265139 (10jbond) [13:10:51] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add ssl symlink unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/633179 (owner: 10Filippo Giunchedi) [13:12:14] 10Operations, 10Puppet, 10puppet-compiler, 10User-jbond: OKR: Worked required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [13:12:26] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' . [13:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:18] (03CR) 10JMeybohm: [C: 03+2] wikifeeds: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632719 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [13:19:25] (03Merged) 10jenkins-bot: wikifeeds: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632719 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [13:20:39] (03PS1) 10Gehel: wdqs: don't fail if journal does not exist before data reload [cookbooks] - 10https://gerrit.wikimedia.org/r/633182 (https://phabricator.wikimedia.org/T255399) [13:20:51] (03PS2) 10Gehel: wdqs: don't fail if journal does not exist before data reload [cookbooks] - 10https://gerrit.wikimedia.org/r/633182 (https://phabricator.wikimedia.org/T255399) [13:23:35] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [13:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:44] (03CR) 10DCausse: [C: 03+1] wdqs: don't fail if journal does not exist before data reload [cookbooks] - 10https://gerrit.wikimedia.org/r/633182 (https://phabricator.wikimedia.org/T255399) (owner: 10Gehel) [13:24:07] (03PS1) 10Andrew Bogott: nova-fullstack monitoring: turn on debug logging [puppet] - 10https://gerrit.wikimedia.org/r/633183 (https://phabricator.wikimedia.org/T265140) [13:25:03] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack monitoring: turn on debug logging [puppet] - 10https://gerrit.wikimedia.org/r/633183 (https://phabricator.wikimedia.org/T265140) (owner: 10Andrew Bogott) [13:25:44] 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (mx2001 is blacklisted by abusix blacklist) - https://phabricator.wikimedia.org/T264504 (10herron) >>! In T264504#6532332, @Xqt wrote: > I tried to get a confirmation mail again and got it. It worked as expected now. Thanks a lot.... [13:27:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T260379 (10Cmjohnson) 05Open→03Invalid This is an old ticket, @jgreen just made a new task for the same server. Killing this off [13:29:15] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [13:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:44] (03PS1) 10Kormat: puppetmaster: Make self-master-post-receive more general. [puppet] - 10https://gerrit.wikimedia.org/r/633184 [13:30:43] (03CR) 10Gehel: [C: 03+2] wdqs: don't fail if journal does not exist before data reload [cookbooks] - 10https://gerrit.wikimedia.org/r/633182 (https://phabricator.wikimedia.org/T255399) (owner: 10Gehel) [13:31:44] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [13:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:50] dcausse: ^ restarted, we'll see on Monday how it went! [13:32:55] gehel: thanks! [13:33:52] (03CR) 10Filippo Giunchedi: [C: 03+1] puppetmaster: Make self-master-post-receive more general. [puppet] - 10https://gerrit.wikimedia.org/r/633184 (owner: 10Kormat) [13:34:20] 10Operations, 10serviceops: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) After a lot of fist shaking and head scratching I think I've found a workable solution, to the problem that PHP build depends on ICU 63 (for intl) and indirectly... [13:34:40] (03CR) 10Kormat: [C: 03+2] puppetmaster: Make self-master-post-receive more general. [puppet] - 10https://gerrit.wikimedia.org/r/633184 (owner: 10Kormat) [13:35:11] 10Operations, 10Puppet, 10puppet-compiler, 10User-jbond: OKR: Worked required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [13:35:37] 10Operations, 10Mail: exim should log the reason for defer with disconnect after HELO/EHLO - https://phabricator.wikimedia.org/T265142 (10herron) p:05Triage→03Medium [13:36:32] 10Operations, 10Mail: exim should log the reason for defer with disconnect after HELO/EHLO - https://phabricator.wikimedia.org/T265142 (10herron) This would have been helpful in troubleshooting T264504 [13:37:47] 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (mx2001 is blacklisted by abusix blacklist) - https://phabricator.wikimedia.org/T264504 (10herron) 05Open→03Resolved a:03herron I think we're in good shape here now. Related exim logging improvements can be coordinated via... [13:40:07] 10Operations, 10Puppet, 10puppet-compiler, 10User-jbond: in puppet 6 some core types have been moved to external modules. check and confirm our exposure - https://phabricator.wikimedia.org/T265143 (10jbond) [13:43:38] (03PS1) 10Elukey: Fix typos for Hadoop test cluster's hostnames [puppet] - 10https://gerrit.wikimedia.org/r/633187 (https://phabricator.wikimedia.org/T255139) [13:44:10] (03CR) 10Elukey: [C: 03+2] Fix typos for Hadoop test cluster's hostnames [puppet] - 10https://gerrit.wikimedia.org/r/633187 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey) [13:45:52] !log helm rollback push-notification in eqiad to revision 8 [13:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:24] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [13:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:55] 10Operations, 10Puppet, 10puppet-compiler, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10Reedy) [14:02:07] 10Operations, 10Puppet, 10puppet-compiler, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) > will move to puppet6 untill at least bullseye Its worth noting that bullseye currently has puppet 5.5.19 (with sid on 5.5.21) its not clear if bullsey... [14:08:04] (03CR) 10JMeybohm: [C: 03+2] zotero: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632720 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [14:08:07] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) Hi @ayounsi, can you help me? I have some more questions: * What is the field that we want to extract the AS name for? I see as_src, as_d... [14:12:35] (03Merged) 10jenkins-bot: zotero: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/632720 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [14:17:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:18:19] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [14:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:22:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:24:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:28:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/633172 (https://phabricator.wikimedia.org/T262647) (owner: 10Muehlenhoff) [14:32:34] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' . [14:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:06] 10Operations, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, 10serviceops: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10Joe) Adding some notes after yesterday's meeting: - the current script is using `sqlitedict` right now, and t... [14:36:19] (03CR) 10Muehlenhoff: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/633172 (https://phabricator.wikimedia.org/T262647) (owner: 10Muehlenhoff) [14:37:25] (03PS1) 10Klausman: amd_rocm: Ensure linux-headers-amd64 is installed [puppet] - 10https://gerrit.wikimedia.org/r/633194 [14:38:05] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) > * What is the field that we want to extract the AS name for? I see as_src, as_dst, peer_as_src, peer_as_dst? Ideally all of them, but a... [14:41:05] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' . [14:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:57] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/633172 (https://phabricator.wikimedia.org/T262647) (owner: 10Muehlenhoff) [15:13:16] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) With [[ https://github.com/rakyll/hey | hey ]] on cp3052 (Varnish 5.1.3-1wm15) and cp3054 (6.0.6-1wm1) I obtained the following two late... [15:15:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={federate-ops,prometheus} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:16:39] PROBLEM - Check systemd state on mwmaint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:11] PROBLEM - ping-offload grafana alert on alert1001 is CRITICAL: CRITICAL: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is alerting: target IP missing on hosts loopback. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert https://grafana.wikimedia.org/d/000000513/ [15:21:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:22:42] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [15:24:20] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [15:25:38] (03PS1) 10JMeybohm: service_proxy: add node.js keepalive to push-notifications [puppet] - 10https://gerrit.wikimedia.org/r/633199 [15:25:42] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [15:32:50] hmm prometheus1004 prometheus@ops[9562]: fatal error: runtime: out of memory [15:33:14] :( sad_trombone.wav [15:33:16] restarted by systemd [15:34:43] yeah I'm guessing an heavy query [15:35:40] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=misc file=smartmon.prom instance=relforge1004 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [15:37:14] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:39:58] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:40:02] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [15:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:24] RECOVERY - Check systemd state on mwmaint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:50] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [15:42:53] 10Operations: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `testvm3001.esams.wmnet` - testvm3001.esams.wmnet (**WARN**) - **Failed do... [15:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:56] mutante: I'm in a meeting but I can look in a bit at the failrue [15:43:58] *failure [15:46:10] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [15:47:34] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [15:48:08] volans|off: thank you, alright. this was a VM that existed in ganeti but was not in puppetdb. maybe that is causing something [15:48:33] did not have that issue last night removing one like it.. but the difference was that was in site.pp for a short time [15:48:34] RECOVERY - ping-offload grafana alert on alert1001 is OK: OK: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is not alerting. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert https://grafana.wikimedia.org/d/000000513/ [15:48:55] i'll try now with the 3rd and last one [15:49:07] mutante: nah that's not the issue, as you can see in the phab update [15:49:15] the dns step failed, that's why I want to look at it [15:49:23] the icinga downtime it's just a warning and no factor [15:50:19] ack, actual failure to run dns cookbook [15:54:54] mutante: I think was a failed netbox API call, I want to add retry logic that depends on updating pynetbox [15:55:04] with the next host it should show you the diff for both [15:55:09] in the DNS part [15:55:13] volans|off: should i try repeating the exact command one more time? [15:55:18] nah, no need [15:55:20] ok [15:55:25] in case you didn't have another to decom [15:55:33] I would have asked you to run the sre.dns.netbox cookbook [15:55:37] to sync the dns part [15:55:37] i have one more. i am just not sure if this one is in a zombie state now [15:55:45] checks netbox [15:55:55] shouldn't matter [15:55:59] ok [15:56:07] but feel free to ask if it's in some very weird state [15:56:24] ack. netbox looks ok. doing the last one [15:56:35] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [15:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:04] (03PS1) 10Ayounsi: Prioritize SG-IX [homer/public] - 10https://gerrit.wikimedia.org/r/633200 (https://phabricator.wikimedia.org/T260991) [15:58:58] (03PS2) 10Dzahn: install_server: remove testvm[345]001 [puppet] - 10https://gerrit.wikimedia.org/r/632590 [16:06:02] volans|off: it's showing me the DNS diff and that has both VMs, the previous one and this one and just said "done" now [16:06:10] so seems all good [16:06:14] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [16:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:19] 10Operations: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `testvm5001.eqsin.wmnet` - testvm5001.eqsin.wmnet (**WARN**) - **Failed do... [16:06:57] mutante: ack, as expected [16:07:54] yep, thanks [16:08:04] (03PS1) 10Dave Pifke: [WIP] Start puppetizing WebPageTest [puppet] - 10https://gerrit.wikimedia.org/r/633202 (https://phabricator.wikimedia.org/T262962) [16:10:23] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [16:10:53] 10Operations, 10Analytics, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10lexnasser) @elukey Yep, that's the correct email. I also confirm that I'm now able to access Turnilo and Stat1007. Thanks for your help! [16:11:59] (03CR) 10Dzahn: [C: 03+2] "all 3 have been decom'ed with cookbook" [puppet] - 10https://gerrit.wikimedia.org/r/632590 (owner: 10Dzahn) [16:12:24] 10Operations, 10User-jbond: Proposal: create a framework to build containerized incident management protects - https://phabricator.wikimedia.org/T265153 (10jbond) p:05Triage→03Low [16:14:49] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) @CDanis we didn't remove fetches against `/api/rest_v1/page/random... [16:18:40] 10Operations, 10User-jbond: Proposal: create a framework to build containerized incident management protects - https://phabricator.wikimedia.org/T265153 (10jbond) [16:22:50] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) >>! In T264881#6532868, @Tsevener wrote: > @CDanis we didn't remove... [16:35:46] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25811/" [puppet] - 10https://gerrit.wikimedia.org/r/633020 (owner: 10Dzahn) [16:36:05] (03PS3) 10Dzahn: ci: replace hiera with lookup, jenkins, shipyard, pipeline, k8s [puppet] - 10https://gerrit.wikimedia.org/r/633017 [16:36:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:38:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:39:59] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "all of these are on ci::master, so here's the noop:" [puppet] - 10https://gerrit.wikimedia.org/r/633017 (owner: 10Dzahn) [16:40:37] (03CR) 10Dzahn: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633017 (owner: 10Dzahn) [16:42:16] (03CR) 10Dzahn: "noop confirmed on contint1001" [puppet] - 10https://gerrit.wikimedia.org/r/633017 (owner: 10Dzahn) [16:52:12] (03PS1) 10Gergő Tisza: Enable session-ip log channel on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633210 (https://phabricator.wikimedia.org/T264799) [16:54:20] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) > > What is the field that we want to extract the AS name for? I see as_src, as_dst, peer_as_src, peer_as_dst? > Ideally all of them, but... [17:03:44] (03CR) 10Dzahn: "noop confirmed wtp2015" [puppet] - 10https://gerrit.wikimedia.org/r/633020 (owner: 10Dzahn) [17:04:33] (03PS2) 10Razzi: geoip: move archive timer from stat1007 to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/633032 (https://phabricator.wikimedia.org/T264152) [17:10:49] (03CR) 10Urbanecm: [C: 03+1] "LGTM, but perhaps this should be everywhere, at least until T264369 is resolved?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633210 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [17:15:10] (03CR) 10Elukey: [C: 03+1] amd_rocm: Ensure linux-headers-amd64 is installed (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/633194 (owner: 10Klausman) [17:15:23] (03CR) 10Brennen Bearnes: [C: 03+1] local_dev::docker_publish: hiera->lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/633027 (owner: 10Dzahn) [17:16:39] (03CR) 10Dzahn: [C: 03+2] "thanks, that was quick 😊" [puppet] - 10https://gerrit.wikimedia.org/r/633027 (owner: 10Dzahn) [17:18:11] (03CR) 10Gergő Tisza: "Eventually, yeah. I want to make sure it's not overloading Logstash or Kask." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633210 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [17:20:58] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) > Or maybe I misunderstood what needs to be done here... I assumed we want to determine whether the given IP is v4 or v6. But which IP w... [17:24:30] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 5: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/631315 (owner: 10Dzahn) [17:29:29] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) @CDanis Thanks! That response is working fine in my testing, feel... [17:32:58] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) Great! Thanks @Tsevener ! [17:44:37] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:31] (03PS8) 10Dzahn: toolforge/grid: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631315 [17:51:05] (03CR) 10Dzahn: [C: 03+2] toolforge/grid: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631315 (owner: 10Dzahn) [17:58:02] (03CR) 10Ori.livneh: "Thanks for this" [puppet] - 10https://gerrit.wikimedia.org/r/631895 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [18:01:16] (03CR) 10Dzahn: "thank you Ori. looks like I failed to even add you to reviewers. that was an accident." [puppet] - 10https://gerrit.wikimedia.org/r/631895 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [18:05:35] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) no sure why but some host take a looong time ` Completed SYN Stealth Scan against 208.80.153.45 in 50099.78s (63 hosts left) Completed SYN Stealth... [18:08:14] (03PS1) 10Andrew Bogott: Cloud puppetmasters: Rename an argument in Profile::Pupetmaster::Backend [puppet] - 10https://gerrit.wikimedia.org/r/633215 [18:09:19] (03CR) 10Andrew Bogott: [C: 03+2] Cloud puppetmasters: Rename an argument in Profile::Pupetmaster::Backend [puppet] - 10https://gerrit.wikimedia.org/r/633215 (owner: 10Andrew Bogott) [18:14:17] (03CR) 10Dzahn: "confirmed no issue on tools-sgeexec-0901. thank you as well." [puppet] - 10https://gerrit.wikimedia.org/r/631315 (owner: 10Dzahn) [18:27:21] 10Operations, 10Technical-blog-posts, 10Traffic: Blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T264729 (10srodlund) It looks good to me! I am copying it over to the blog for publication next week. (Tuesday 13 Oct) [18:44:08] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [18:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:56] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10nettrom_WMF) @kostajh : Thanks for picking this up and pinging me about it. I think we should switch off EditorJourney since... [18:52:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:56:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:00:47] (03PS1) 10Herron: logstash: add field checks to filter throttle [puppet] - 10https://gerrit.wikimedia.org/r/633224 [19:02:51] (03PS1) 10Dzahn: add testvm1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/633225 [19:04:01] (03CR) 10Dzahn: [C: 03+2] add testvm1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/633225 (owner: 10Dzahn) [19:06:40] (03CR) 10Urbanecm: [C: 03+1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633210 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [19:08:17] (03PS1) 10Dzahn: site: add testvm1001 with appserver role for a test [puppet] - 10https://gerrit.wikimedia.org/r/633226 [19:10:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [19:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:32] (03PS2) 10Dzahn: site/DHCP: add testvm1001 with appserver role for a test [puppet] - 10https://gerrit.wikimedia.org/r/633226 [19:14:20] (03PS1) 10Razzi: turnilo: switch from nginx to envoy for tls termination [puppet] - 10https://gerrit.wikimedia.org/r/633227 (https://phabricator.wikimedia.org/T240439) [19:22:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:22:48] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/633229 [19:23:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:25:01] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/633229 [19:25:11] (03CR) 10Razzi: "I figured it'd be easiest to make this change for a single host, iterate on that, and then roll it out to the others. I can see there's so" [puppet] - 10https://gerrit.wikimedia.org/r/633227 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [19:34:16] 10Operations, 10Machine Learning Platform, 10SRE-Access-Requests: Requesting adding to ores-admin for Ladsgroup - https://phabricator.wikimedia.org/T265172 (10Ladsgroup) [19:46:04] 10Operations, 10Security-Team: Remove Chase Pettet from security@ alias in exim - https://phabricator.wikimedia.org/T265175 (10sbassett) [19:46:26] 10Operations, 10Security-Team: Remove Chase Pettet from security@ alias in exim - https://phabricator.wikimedia.org/T265175 (10sbassett) [19:46:28] (03PS3) 10CDanis: VCL: temp. ratelimit iOS app fetches of random page summary [puppet] - 10https://gerrit.wikimedia.org/r/633229 (https://phabricator.wikimedia.org/T264881) [19:47:06] 10Operations, 10Security-Team: Remove Chase Pettet from security@ alias in exim - https://phabricator.wikimedia.org/T265175 (10sbassett) [19:49:01] (03CR) 10Hashar: [C: 03+1] Enable session-ip log channel on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633210 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [19:49:07] 10Operations, 10Security-Team: Remove Chase Pettet from security@ alias in exim - https://phabricator.wikimedia.org/T265175 (10sbassett) [19:49:51] 10Operations, 10Security-Team: Remove Chase Pettet from security@ alias in exim - https://phabricator.wikimedia.org/T265175 (10Dzahn) Hi @sbassett security@ isn't in exim anymore nowadays. It's in Google, you'll have to ask OIT via Zendesk please: ` [mx1001:~] $ sudo exim4 -bt security@wikimedia.org security... [19:50:51] 10Operations, 10Security-Team: Remove Chase Pettet from security@ alias in Google - https://phabricator.wikimedia.org/T265175 (10sbassett) [19:52:07] (03CR) 10Hashar: [C: 03+1] "If we don't want to roll it this Friday we can have:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633210 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [19:53:38] (03CR) 10CDanis: "PCC looks correct: https://puppet-compiler.wmflabs.org/compiler1001/25816/" [puppet] - 10https://gerrit.wikimedia.org/r/633229 (https://phabricator.wikimedia.org/T264881) (owner: 10CDanis) [19:54:05] (03CR) 10Gergő Tisza: "Per the recent discussion in #mediawiki_security, I think it's OK to roll this out today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633210 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [20:08:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:09:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:15:50] (03CR) 10Hashar: [C: 03+1] "Looks good to me :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633210 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [20:16:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:18:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:40:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:42:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:44:04] (03CR) 10RLazarus: [C: 03+1] VCL: temp. ratelimit iOS app fetches of random page summary [puppet] - 10https://gerrit.wikimedia.org/r/633229 (https://phabricator.wikimedia.org/T264881) (owner: 10CDanis) [20:46:30] (03CR) 10CDanis: [C: 03+2] "0 tests failed, 0 tests skipped, 22 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/633229 (https://phabricator.wikimedia.org/T264881) (owner: 10CDanis) [20:55:28] (03PS1) 10Gergő Tisza: Log IP/device changes within the same session [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/633252 (https://phabricator.wikimedia.org/T264799) [20:55:55] (03PS1) 10Gergő Tisza: Log IP/device changes within the same session [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/633253 (https://phabricator.wikimedia.org/T264799) [20:56:53] (03PS1) 10Gergő Tisza: SessionManager: Always log IP/UA in session-ip [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/633254 (https://phabricator.wikimedia.org/T264799) [20:57:53] (03PS1) 10Gergő Tisza: SessionManager: Always log IP/UA in session-ip [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/633255 (https://phabricator.wikimedia.org/T264799) [20:58:54] (03CR) 10Gergő Tisza: [C: 03+2] "deploying" [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/633252 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [20:59:00] (03CR) 10Gergő Tisza: [C: 03+2] "deploying" [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/633253 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [20:59:25] (03CR) 10Gergő Tisza: [C: 03+2] "deploying" [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/633254 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [20:59:42] (03CR) 10Gergő Tisza: [C: 03+2] "deploying" [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/633255 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [21:06:15] (03CR) 10Dzahn: [C: 03+2] site/DHCP: add testvm1001 with appserver role for a test [puppet] - 10https://gerrit.wikimedia.org/r/633226 (owner: 10Dzahn) [21:11:46] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10wiki_willy) a:05wiki_willy→03Cmjohnson PDUs were shipped out today and should arrive next week. Assigning back to @Cmjohnson to complete... [21:28:43] (03Merged) 10jenkins-bot: Log IP/device changes within the same session [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/633252 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [21:32:13] (03Merged) 10jenkins-bot: Log IP/device changes within the same session [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/633253 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [21:33:15] (03Merged) 10jenkins-bot: SessionManager: Always log IP/UA in session-ip [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/633254 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [21:33:18] (03Merged) 10jenkins-bot: SessionManager: Always log IP/UA in session-ip [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/633255 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [21:53:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:53:41] !log [urbanecm@mwmaint2001 ~]$ mwscript extensions/CentralAuth/maintenance/attachAccount.php --wiki=dewiki --userlist users.txt # users.txt contains Almeida # T263935 [21:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:47] T263935: Local account not attached after SUL-Finalization - https://phabricator.wikimedia.org/T263935 [21:54:59] tgr_: you seem to have unsynced patches? [21:55:50] (03PS1) 10Dzahn: partman: add testvm to use standard flat/virtual recipe [puppet] - 10https://gerrit.wikimedia.org/r/633269 [21:56:05] Urbanecm: do I? I haven't pulled anything to the deploy host yet [21:56:26] tgr_: yes, but you merged patches [21:57:33] (I don't plan to roll anything at a friday evening, I just noticed that when I logged, and saw that some patches were merged 20 minutes ago, and not synced, so I pinged oyu) [21:58:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:58:30] yeah, I went for lunch while CI was working [21:59:19] i see :). [22:01:19] !log rolling out T264799#6533622 [22:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:24] T264799: Log when a request with the same user session comes from a different IP - https://phabricator.wikimedia.org/T264799 [22:09:43] !log tgr@deploy1001 Synchronized php-1.36.0-wmf.10/includes/: Backport: [[gerrit:633252|Log IP/device changes within the same session (T264799)]] & [[gerrit:633254|SessionManager: Always log IP/UA in session-ip]] (duration: 01m 06s) [22:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:50] T264799: Log when a request with the same user session comes from a different IP - https://phabricator.wikimedia.org/T264799 [22:12:50] that caused a bit of an error spike. I should have synced file by file, probably. [22:13:19] (03CR) 10Gergő Tisza: [C: 03+2] Enable session-ip log channel on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633210 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [22:14:03] (03Merged) 10jenkins-bot: Enable session-ip log channel on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633210 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [22:14:07] (03CR) 10Dzahn: [C: 03+2] partman: add testvm to use standard flat/virtual recipe [puppet] - 10https://gerrit.wikimedia.org/r/633269 (owner: 10Dzahn) [22:20:13] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:633210|Enable session-ip log channel on group0 (T264799)]] (duration: 00m 59s) [22:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:20] T264799: Log when a request with the same user session comes from a different IP - https://phabricator.wikimedia.org/T264799 [22:23:28] !log tgr@deploy1001 Synchronized php-1.36.0-wmf.11/includes/: Backport: [[gerrit:633252|Log IP/device changes within the same session (T264799)]] & [[gerrit:633254|SessionManager: Always log IP/UA in session-ip]] (duration: 01m 04s) [22:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:38] (03PS2) 10Dzahn: wmcs::instance: remove diamond removal remnants [puppet] - 10https://gerrit.wikimedia.org/r/632570 (https://phabricator.wikimedia.org/T210993) [22:26:58] (03PS1) 10Gergő Tisza: Enable session-ip log channel on group1, except Commons/Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633271 (https://phabricator.wikimedia.org/T264799) [22:34:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:35:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:45:30] (03CR) 10Gergő Tisza: [C: 03+2] Enable session-ip log channel on group1, except Commons/Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633271 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [22:46:37] (03Merged) 10jenkins-bot: Enable session-ip log channel on group1, except Commons/Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633271 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [22:52:15] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:633271|Enable session-ip log channel on group1, except Commons/Wikidata (T264799)]] (duration: 00m 57s) [22:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:22] T264799: Log when a request with the same user session comes from a different IP - https://phabricator.wikimedia.org/T264799 [22:57:19] (03PS1) 10Gergő Tisza: Enable session-ip log channel on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633272 (https://phabricator.wikimedia.org/T264799) [23:02:11] PROBLEM - PHP7 rendering on testvm1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:03:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:03:38] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:16] ACKNOWLEDGEMENT - PHP7 rendering on testvm1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn test https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:05:01] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Dzahn) maps2010 is reported as down since about 3 days [23:05:24] (03PS1) 10Urbanecm: Allow testwiki bureaucrats to grant and revoke (transwiki) importer rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633273 [23:06:08] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:06:08] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:06:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:06:12] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:35] !log maps2010 is down since almost 3 days - unhandled crit alert but nothing in SAL or tickets [23:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:35] (03CR) 10DannyS712: Allow testwiki bureaucrats to grant and revoke (transwiki) importer rights (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633273 (owner: 10Urbanecm) [23:10:15] (03PS2) 10Urbanecm: [testwiki, test2wiki] Allow bureaucrats to grant import rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633273 [23:11:24] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Dzahn) Is there a ticket for moving these into production? [23:11:57] (03CR) 10Gergő Tisza: [C: 03+2] Enable session-ip log channel on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633272 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [23:13:06] (03Merged) 10jenkins-bot: Enable session-ip log channel on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633272 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [23:13:49] !log maps2010 is down since almost 3 days - unhandled crit alert but nothing in SAL and only related ticket says resolved - powercycling it - boots normal but doesn't have a prod role (T260271) [23:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:55] T260271: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 [23:16:17] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:17:49] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:24:51] 10Operations, 10serviceops: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Dzahn) There was already T245757 with dependency tickets. [23:25:12] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:633272|Enable session-ip log channel on Commons (T264799)]] (duration: 00m 59s) [23:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:18] T264799: Log when a request with the same user session comes from a different IP - https://phabricator.wikimedia.org/T264799 [23:26:07] 10Operations, 10serviceops: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Dzahn) As well as T250515 for the PHP packages. [23:31:13] (03PS1) 10Gergő Tisza: Enable session-ip log channel on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633274 (https://phabricator.wikimedia.org/T264799) [23:31:15] 10Operations, 10serviceops: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Dzahn) > Fix all of our puppet code for MediaWiki for incompatibilities with buster I applied the puppet role on a buster test VM in eqiad and the following packages are missing:... [23:40:46] (03CR) 10Gergő Tisza: [C: 03+2] Enable session-ip log channel on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633274 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [23:41:39] (03Merged) 10jenkins-bot: Enable session-ip log channel on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633274 (https://phabricator.wikimedia.org/T264799) (owner: 10Gergő Tisza) [23:44:58] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:633274|Enable session-ip log channel on Wikidata (T264799)]] (duration: 00m 59s) [23:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:04] T264799: Log when a request with the same user session comes from a different IP - https://phabricator.wikimedia.org/T264799 [23:46:06] (03PS1) 10Dzahn: mediawiki: replace font package ttf-alee with fonts-alee [puppet] - 10https://gerrit.wikimedia.org/r/633275 (https://phabricator.wikimedia.org/T264991) [23:49:35] (03PS1) 10Gergő Tisza: Enable session-ip log channel on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633276 (https://phabricator.wikimedia.org/T264799) [23:51:30] (03PS2) 10Dzahn: mediawiki: replace font package ttf-alee with fonts-alee [puppet] - 10https://gerrit.wikimedia.org/r/633275 (https://phabricator.wikimedia.org/T264991) [23:51:39] 10Operations, 10serviceops, 10Patch-For-Review: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Legoktm) >>! In T264991#6533968, @Dzahn wrote: > - ploticus {T253377} > - php7.2-opcache > - php7.2-common These should be php7.3 now. [23:53:53] 10Operations, 10serviceops, 10Patch-For-Review: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Dzahn) [23:54:32] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Peachey88)