[00:00:59] (03PS13) 10Cwhite: hiera: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [00:03:23] (03PS14) 10Cwhite: hiera: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [00:08:47] (03PS15) 10Cwhite: hiera: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [00:11:59] 10Operations, 10Cloud-Services, 10procurement, 10cloud-services-team (Kanban): ssl renewal: *.wmflabs.org expires 2019-11-16 - https://phabricator.wikimedia.org/T233176 (10Bstorm) [00:12:03] (03PS16) 10Cwhite: hiera: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [00:13:53] 10Operations, 10Cloud-Services, 10procurement, 10cloud-services-team (Kanban): ssl renewal: *.wmflabs.org expires 2019-11-16 - https://phabricator.wikimedia.org/T233176 (10Bstorm) 05Open→03Declined This might need a procurement so, I'm going to rebuild the task. [00:17:01] (03CR) 10Cwhite: "https://puppet-compiler.wmflabs.org/compiler1001/18358/" [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [00:21:07] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:21:21] PROBLEM - OSPF status on cr2-knams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:22:41] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:22:55] RECOVERY - OSPF status on cr2-knams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:43:21] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [01:32:03] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/references/{title} (Get references of a test page) timed out before a response was received: /{domain}/v1/data/javascript/mobile/pagelib (Get javascript bundle for page library) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.w [01:32:03] /Mobileapps_%28service%29 [01:33:33] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:39:51] (03CR) 10Andrew Bogott: [C: 03+1] "Scrapping it sounds right to me." [puppet] - 10https://gerrit.wikimedia.org/r/537536 (owner: 10Dzahn) [02:04:19] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:07:17] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 40 probes of 460 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:12:53] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 28 probes of 460 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:40:41] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 88091344 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:53:19] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 135176 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:21:34] (03PS1) 10CRusnov: profile::authdns: Add automation framework [puppet] - 10https://gerrit.wikimedia.org/r/537576 [03:22:11] (03CR) 10jerkins-bot: [V: 04-1] profile::authdns: Add automation framework [puppet] - 10https://gerrit.wikimedia.org/r/537576 (owner: 10CRusnov) [03:26:29] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:31:11] 10Operations, 10Traffic, 10Performance-Team (Radar): Some HTTP requests for MW failing due to "ERR_SPDY_PROTOCOL_ERROR 200" - https://phabricator.wikimedia.org/T220022 (10matmarex) Has anyone seen this issue again in the past two weeks? If not, the VE patch might have fixed it… [03:31:39] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:37:38] (03PS2) 10Mathew.onipe: query_service: change wdqs module to query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [03:41:06] (03Abandoned) 10Mathew.onipe: Add SDQS module [puppet] - 10https://gerrit.wikimedia.org/r/535844 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [03:42:11] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:53:48] (03PS2) 10CRusnov: profile::authdns: Add automation framework [puppet] - 10https://gerrit.wikimedia.org/r/537576 (https://phabricator.wikimedia.org/T233183) [03:54:42] (03CR) 10jerkins-bot: [V: 04-1] profile::authdns: Add automation framework [puppet] - 10https://gerrit.wikimedia.org/r/537576 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [03:55:18] (03PS6) 10CRusnov: Initial support for custom scripts [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (https://phabricator.wikimedia.org/T230449) [03:55:30] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.22/resources/src/mediawiki.util/: 0333729e, ccfe88241 (duration: 01m 07s) [03:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:06:25] (03PS3) 10CRusnov: profile::authdns: Add automation framework [puppet] - 10https://gerrit.wikimedia.org/r/537576 (https://phabricator.wikimedia.org/T233183) [04:07:02] (03CR) 10jerkins-bot: [V: 04-1] profile::authdns: Add automation framework [puppet] - 10https://gerrit.wikimedia.org/r/537576 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [04:26:51] 10Operations, 10netbox, 10observability: netbox / netmon1002: netbox report related service units failed - https://phabricator.wikimedia.org/T224517 (10crusnov) 05Open→03Resolved netmon1002 is cleaned up now and should not be alerting on these basis anymore. [04:45:50] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10netops: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Marostegui) 05Open→03Resolved Going to close this for now. Feel free to reopen if needed. [04:58:13] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10Urbanecm) >>! In T231616#5500945, @Nuria wrote: > @MMiller_WMF and @Urbanecm > > I want to clarify that we do not grant access to private data to voluntee... [05:02:38] 10Operations, 10ops-codfw, 10DBA: db2127 memory issues - https://phabricator.wikimedia.org/T233184 (10Marostegui) [05:03:07] 10Operations, 10ops-codfw, 10DBA: db2127 memory issues - https://phabricator.wikimedia.org/T233184 (10Marostegui) p:05Triage→03Normal We are leaving this task opened for a few days to see if the errors get back. [05:03:41] !log Start MySQL on db2127 T233184 [05:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:45] T233184: db2127 memory issues - https://phabricator.wikimedia.org/T233184 [05:10:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10Nuria) Sorry this is disappointing but given our very limited resources we really cannot support ad-hoc data access for community members, the best way we h... [05:20:20] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:21:15] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:22:29] 10Operations, 10DBA: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Marostegui) [05:23:34] 10Operations, 10DBA: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Marostegui) p:05Triage→03Normal [05:25:45] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537581 (https://phabricator.wikimedia.org/T233186) [05:28:36] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:29:34] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537581 (https://phabricator.wikimedia.org/T233186) (owner: 10Marostegui) [05:29:54] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Marostegui) [05:30:22] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537581 (https://phabricator.wikimedia.org/T233186) (owner: 10Marostegui) [05:30:40] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537581 (https://phabricator.wikimedia.org/T233186) (owner: 10Marostegui) [05:31:47] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2055 from config T233186 (duration: 01m 06s) [05:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:51] T233186: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 [05:33:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2055 from config T233186 (duration: 01m 04s) [05:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:33] (03PS1) 10Marostegui: db2055: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/537582 (https://phabricator.wikimedia.org/T233186) [05:45:35] (03CR) 10Marostegui: [C: 03+2] db2055: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/537582 (https://phabricator.wikimedia.org/T233186) (owner: 10Marostegui) [05:46:03] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2055.codfw.wmnet - https://phabricator.wikimedia.org/T233186 (10Marostegui) [05:47:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool host after onsite checks T233184', diff saved to https://phabricator.wikimedia.org/P9123 and previous config saved to /var/cache/conftool/dbconfig/20190918-054755-marostegui.json [05:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:59] T233184: db2127 memory issues - https://phabricator.wikimedia.org/T233184 [05:49:23] (03PS11) 10DannyS712: Fix typos in code [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) [05:49:48] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:54:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10Urbanecm) >>! In T231616#5502047, @Nuria wrote: > Sorry this is disappointing but given our very limited resources we really cannot support ad-hoc data acce... [05:58:05] !log Deploy schema change on db2097:3316 - T233135 [05:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:08] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [06:00:58] (03PS3) 10Giuseppe Lavagetto: envoy: add command-line options to be used for zone-aware routing [puppet] - 10https://gerrit.wikimedia.org/r/536150 [06:04:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2089:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P9124 and previous config saved to /var/cache/conftool/dbconfig/20190918-060401-marostegui.json [06:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoy: add command-line options to be used for zone-aware routing [puppet] - 10https://gerrit.wikimedia.org/r/536150 (owner: 10Giuseppe Lavagetto) [06:17:17] (03PS1) 10Giuseppe Lavagetto: envoy: add zone-aware routing variables in jessie too [puppet] - 10https://gerrit.wikimedia.org/r/537583 [06:17:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoy: add zone-aware routing variables in jessie too [puppet] - 10https://gerrit.wikimedia.org/r/537583 (owner: 10Giuseppe Lavagetto) [06:29:49] (03PS1) 10Giuseppe Lavagetto: envoy: fix typo in restarter script [puppet] - 10https://gerrit.wikimedia.org/r/537584 [06:30:25] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] envoy: fix typo in restarter script [puppet] - 10https://gerrit.wikimedia.org/r/537584 (owner: 10Giuseppe Lavagetto) [06:43:51] !log reimaging restbase2011 to stretch T224553 [06:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:54] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [06:48:13] (03PS2) 10Muehlenhoff: restbase2011: Add JBOD hiera config for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/536566 [06:56:44] (03PS2) 10ArielGlenn: one-off for generating some page meta history files [dumps] - 10https://gerrit.wikimedia.org/r/537546 [07:00:16] (03CR) 10Muehlenhoff: [C: 03+2] restbase2011: Add JBOD hiera config for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/536566 (owner: 10Muehlenhoff) [07:03:12] (03CR) 10Muehlenhoff: "Yeah, let's drop the whole labpuppetmaster alias." [puppet] - 10https://gerrit.wikimedia.org/r/537536 (owner: 10Dzahn) [07:05:51] (03PS1) 10Elukey: Add hadoop overrides for analytics1034 [puppet] - 10https://gerrit.wikimedia.org/r/537585 [07:06:05] (03CR) 10Elukey: [C: 03+2] Add hadoop overrides for analytics1034 [puppet] - 10https://gerrit.wikimedia.org/r/537585 (owner: 10Elukey) [07:14:34] 10Operations, 10SRE-Access-Requests: Requesting access to Ops Group for papaul@ - https://phabricator.wikimedia.org/T233189 (10wiki_willy) [07:17:51] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:23] (03PS3) 10ArielGlenn: one-off for generating some page meta history files [dumps] - 10https://gerrit.wikimedia.org/r/537546 [07:19:32] (03PS1) 10Awight: Enable FileImport source wiki editing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537586 (https://phabricator.wikimedia.org/T228851) [07:19:50] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:10] (03PS15) 10Urbanecm: Initial configuration for hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498317 (https://phabricator.wikimedia.org/T218155) [07:28:43] (03CR) 10Urbanecm: "> Patch Set 14: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498317 (https://phabricator.wikimedia.org/T218155) (owner: 10Urbanecm) [07:29:26] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537534 (https://phabricator.wikimedia.org/T233137) (owner: 10Zoranzoki21) [07:31:19] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [07:32:02] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:32:14] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:32:16] PROBLEM - OSPF status on cr2-knams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:33:22] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:33:32] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:33:34] RECOVERY - OSPF status on cr2-knams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:33:45] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: use prometheus for logstash alerting [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [07:34:54] (03CR) 10Urbanecm: [C: 04-1] "You need to re-apply on VariantSettings.php." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536764 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2) [07:36:01] (03Abandoned) 10Urbanecm: Enable Extension:Newsletter on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151) (owner: 10Zoranzoki21) [07:36:08] (03PS1) 10Elukey: Add hadoop overrides for analytics1038 [puppet] - 10https://gerrit.wikimedia.org/r/537587 [07:39:52] 10Operations, 10ops-eqiad, 10DC-Ops: update puppet for new PDU models - https://phabricator.wikimedia.org/T233129 (10fgiunchedi) That's correct, for PDUs we've already upgraded/replaced no updates needed, although for upcoming PDU replacements there will be updates needed [07:43:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) [07:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:18] \o/ [07:43:34] (03CR) 10Elukey: [C: 03+2] Add hadoop overrides for analytics1038 [puppet] - 10https://gerrit.wikimedia.org/r/537587 (owner: 10Elukey) [07:44:25] (03CR) 10Jcrespo: [C: 03+2] install_server: Update partman recipe to set / on last disks [puppet] - 10https://gerrit.wikimedia.org/r/537336 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [07:44:36] (03PS3) 10Jcrespo: install_server: Update partman recipe to set / on last disks [puppet] - 10https://gerrit.wikimedia.org/r/537336 (https://phabricator.wikimedia.org/T229209) [07:50:48] 10Operations, 10Packaging, 10serviceops, 10CPT Initiatives (Session Management Service (CDP2)): Need help to create and deploy Debian-packaged Python 3 app - https://phabricator.wikimedia.org/T229980 (10jijiki) @holger.knust What is the status of this? Just checking if I can help [07:53:25] (03PS1) 10Vgutierrez: install_server: Enable OCSP stapling and SSL monitoring [puppet] - 10https://gerrit.wikimedia.org/r/537593 (https://phabricator.wikimedia.org/T232988) [07:56:48] 10Operations, 10Analytics, 10User-Elukey: setup/install eqiad kerbos node WMF5173 - https://phabricator.wikimedia.org/T233141 (10elukey) a:05elukey→03RobH Thanks a lot! Hostnames: krb1001 krb2001 (already updated the naming conventions in wikitech) Internal subnet, no Analytics VLAN raid1 is good enough [07:57:19] 10Operations, 10Analytics, 10User-Elukey: setup/install codfw kerbos node WMF6577 - https://phabricator.wikimedia.org/T233142 (10elukey) a:05elukey→03RobH Thanks a lot! Hostnames: krb1001 krb2001 (already updated the naming conventions in wikitech) Internal subnet, no Analytics VLAN raid1 is good enough [08:03:04] 10Operations, 10serviceops: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 (10jijiki) [08:03:33] (03PS1) 10Vgutierrez: ATS: Ensure that ATS service gets enabled [puppet] - 10https://gerrit.wikimedia.org/r/537594 [08:04:13] 10Operations, 10serviceops: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 (10jijiki) @thcipriani is it ok if we update php7 to 7.2.22 on deploy* servers? Do you know if there are any dependencies ? [08:04:57] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10MoritzMuehlenhoff) [08:05:16] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [08:05:28] 10Operations, 10serviceops: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 (10jijiki) [08:05:31] 10Operations, 10serviceops: Remove PHP 7.0 from production application servers - https://phabricator.wikimedia.org/T220600 (10jijiki) [08:06:07] (03PS2) 10Vgutierrez: ATS: Ensure that ATS service gets enabled [puppet] - 10https://gerrit.wikimedia.org/r/537594 [08:09:43] (03Abandoned) 10Vgutierrez: ATS: Ensure that ATS service gets enabled [puppet] - 10https://gerrit.wikimedia.org/r/537594 (owner: 10Vgutierrez) [08:14:19] (03CR) 10Filippo Giunchedi: "See comments inline, overall looks good" (037 comments) [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 (owner: 10Cwhite) [08:16:45] (03CR) 10Volans: [C: 04-1] "I'd rather use dnspython in generateRRs.py instead of re-inventing the wheel of DNS zone file logic." [puppet] - 10https://gerrit.wikimedia.org/r/537576 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [08:20:51] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10fgiunchedi) For additonal context, the UNKNOWN / phase monitoring for new PDUs is tracked here: {T229101} and the reason AFAICT is the SNMP OID change from sentry3 -> sen... [08:21:53] 10Operations, 10ops-eqiad: backup1001 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T232882 (10jcrespo) 05Resolved→03Open Now instead of a failed disks, I can only see 23/24 disks, one disk of the second enclosure is gone. See: ` root@backup1001:~$ sudo megacli -PDList -aALL | grep 'Dev... [08:21:58] 10Operations, 10DBA, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [08:21:59] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 27.75 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:22:06] (03PS1) 10Vgutierrez: ATS: Ensure that the non default instances can be enabled on systemd [puppet] - 10https://gerrit.wikimedia.org/r/537596 [08:22:11] !log bootstrap restbase2011-a -- T224553 [08:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:15] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [08:23:48] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10Marostegui) [08:26:32] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10Marostegui) [08:27:17] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10Marostegui) [08:29:51] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 70.11 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:32:39] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10Marostegui) [08:42:50] 10Operations, 10DBA, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['backup1001.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-a... [08:46:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/537596 (owner: 10Vgutierrez) [08:49:03] (03PS1) 10Filippo Giunchedi: thumbstats: use 'width' not 'pixels' [software] - 10https://gerrit.wikimedia.org/r/537600 [08:49:33] (03CR) 10Filippo Giunchedi: [C: 03+2] "Old comments-only patch I didn't submit" [software] - 10https://gerrit.wikimedia.org/r/537600 (owner: 10Filippo Giunchedi) [08:49:38] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (201909), 10Zuul: Upload zuul_2.5.1-wmf10 to apt.wikimedia.org - https://phabricator.wikimedia.org/T233025 (10hashar) [08:51:41] (03CR) 10Vgutierrez: [C: 03+2] ATS: Ensure that the non default instances can be enabled on systemd [puppet] - 10https://gerrit.wikimedia.org/r/537596 (owner: 10Vgutierrez) [08:51:49] (03PS2) 10Vgutierrez: ATS: Ensure that the non default instances can be enabled on systemd [puppet] - 10https://gerrit.wikimedia.org/r/537596 [08:57:20] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:57:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2089:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P9125 and previous config saved to /var/cache/conftool/dbconfig/20190918-085721-marostegui.json [08:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:42] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime [08:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:22] (03CR) 10Mobrovac: [C: 03+1] sessionstore: configure cassandra for `local_dc` [deployment-charts] - 10https://gerrit.wikimedia.org/r/537552 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [08:59:42] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:27] 10Operations, 10DBA, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup1001.eqiad.wmnet'] ` and were **ALL** successful. [09:07:32] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:20:12] 10Operations, 10DBA, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) backup1001 was also setup, however there is still a missing disk: T232882#5502241. Separating enclosures into different logical drives is going to pay off earlier... [09:21:10] 10Operations, 10MediaWiki-extensions-OATHAuth: Cannot enable 2FA on testwiki - https://phabricator.wikimedia.org/T233146 (10ItSpiderman) Hm, it cant be an array, key can only be generated from `TOTPKey::newFromRandom()` which cannot return an array. Thats weird, im very interested in what that array contains [09:23:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] add generic interface to metrics gathering [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 (owner: 10Cwhite) [09:23:57] (03Merged) 10jenkins-bot: add generic interface to metrics gathering [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 (owner: 10Cwhite) [09:25:00] 10Operations, 10Commons, 10MediaWiki-File-management, 10media-storage, and 2 others: bring swiftrepl back to life - https://phabricator.wikimedia.org/T231110 (10fgiunchedi) I'm looking into the failure at P9064 and here's my theory with relevant code from `sync_container` `lang=python dstobjects = N... [09:25:06] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (201909), 10Zuul: Upload zuul_2.5.1-wmf10 to apt.wikimedia.org - https://phabricator.wikimedia.org/T233025 (10hashar) [09:25:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] debian: extend dpkg-source diff ignore [software/service-checker] - 10https://gerrit.wikimedia.org/r/523121 (owner: 10Hashar) [09:26:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Copy tests fixtures when building the package [software/service-checker] - 10https://gerrit.wikimedia.org/r/523119 (owner: 10Hashar) [09:27:08] (03Merged) 10jenkins-bot: Copy tests fixtures when building the package [software/service-checker] - 10https://gerrit.wikimedia.org/r/523119 (owner: 10Hashar) [09:27:10] (03Merged) 10jenkins-bot: debian: extend dpkg-source diff ignore [software/service-checker] - 10https://gerrit.wikimedia.org/r/523121 (owner: 10Hashar) [09:28:33] (03PS1) 10Filippo Giunchedi: swiftrepl: handle empty srcobjects when scanning for dstobjects [software] - 10https://gerrit.wikimedia.org/r/537610 (https://phabricator.wikimedia.org/T231110) [09:28:46] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:29:44] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on elastic1018 is CRITICAL: 29 ge 4 Gehel server schedule for replacement https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=elastic1018&var-datasource=eqiad+prometheus/ops [09:31:27] (03PS4) 10ArielGlenn: one-off for generating some page meta history files [dumps] - 10https://gerrit.wikimedia.org/r/537546 [09:31:34] (03PS1) 10Vgutierrez: ATS: Unmask trafficserver.service iff it's actually being used [puppet] - 10https://gerrit.wikimedia.org/r/537611 [09:32:08] Suddenly it seems like something hangs at nowiki… Toolbar does not load at CrappyOldEditor… [09:33:07] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (201909), 10Zuul: Upload zuul_2.5.1-wmf10 to apt.wikimedia.org - https://phabricator.wikimedia.org/T233025 (10hashar) Hi @herron, others have pointed me to you for th... [09:34:15] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/18362/" [puppet] - 10https://gerrit.wikimedia.org/r/537611 (owner: 10Vgutierrez) [09:34:20] Its back [09:34:30] 10Operations, 10Thumbor, 10serviceops: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10jijiki) [09:34:41] 10Operations, 10Thumbor, 10serviceops: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10jijiki) p:05Triage→03Normal [09:35:08] !log run swiftrepl eqiad -> codfw for transcoded containers [09:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:32] 10Operations, 10serviceops: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10jijiki) @Dzahn We will be moving Thumbor to k8s T233196, we can repurpose the spare server for something else:) [09:37:43] !log upgrading netmon* to PHP 7.2.22 T230024 [09:37:46] !log run swiftrepl eqiad -> codfw on all containers, no deletes [09:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:47] T230024: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 [09:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:57] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts: ` ['elastic1046.eqiad.wmnet'] ` The log can be found in `/var/log/wmf... [09:38:27] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10observability, and 2 others: Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089 (10hashar) [09:46:31] (03PS3) 10Filippo Giunchedi: WIP swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) [09:46:33] (03PS1) 10Filippo Giunchedi: WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 [09:47:37] (03CR) 10jerkins-bot: [V: 04-1] WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (owner: 10Filippo Giunchedi) [09:53:03] (03PS4) 10Filippo Giunchedi: WIP swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) [09:53:05] (03PS2) 10Filippo Giunchedi: WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 [09:54:05] (03CR) 10jerkins-bot: [V: 04-1] WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (owner: 10Filippo Giunchedi) [09:59:08] (03PS5) 10Filippo Giunchedi: WIP swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) [09:59:10] (03PS3) 10Filippo Giunchedi: WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 [09:59:37] (03PS1) 10Marostegui: wmnet: Point m1-master to dbproxy1014 [dns] - 10https://gerrit.wikimedia.org/r/537617 (https://phabricator.wikimedia.org/T202367) [10:00:10] (03CR) 10jerkins-bot: [V: 04-1] WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (owner: 10Filippo Giunchedi) [10:00:39] !log bootstrap restbase2011-b -- T224553 [10:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:43] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [10:03:57] (03PS6) 10Filippo Giunchedi: WIP swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) [10:03:59] (03PS4) 10Filippo Giunchedi: WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 [10:04:59] (03CR) 10jerkins-bot: [V: 04-1] WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (owner: 10Filippo Giunchedi) [10:07:09] (03PS2) 10Marostegui: wmnet: Point m1-master to dbproxy1014 [dns] - 10https://gerrit.wikimedia.org/r/537617 (https://phabricator.wikimedia.org/T202367) [10:10:08] (03CR) 10Marostegui: [C: 03+2] wmnet: Point m1-master to dbproxy1014 [dns] - 10https://gerrit.wikimedia.org/r/537617 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [10:11:43] (03PS1) 10Jbond: prometheus - bastion: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 [10:13:51] (03CR) 10jerkins-bot: [V: 04-1] prometheus - bastion: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (owner: 10Jbond) [10:16:49] !log restarting postgres on puppetdb1002/2002 after updating permissions for replication user [10:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:41] !log force relocation of shards for eqiad search(chi) cluster [10:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:09] (03PS2) 10Jbond: prometheus - bastion: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 [10:20:12] (03PS6) 10Giuseppe Lavagetto: Add logging support [software/service-checker] - 10https://gerrit.wikimedia.org/r/495238 (owner: 10Alexandros Kosiaris) [10:20:14] (03PS1) 10Giuseppe Lavagetto: Release 0.2.0 [software/service-checker] - 10https://gerrit.wikimedia.org/r/537619 [10:21:00] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10MGerlach) 05Resolved→03Open Thanks, I can ssh into production servers. However, I cannot access SWAP following [[ https://wikitech.wikimedia.org/wiki/... [10:22:10] (03CR) 10jerkins-bot: [V: 04-1] prometheus - bastion: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (owner: 10Jbond) [10:25:16] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10Andrew-WMDE) [10:28:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add logging support [software/service-checker] - 10https://gerrit.wikimedia.org/r/495238 (owner: 10Alexandros Kosiaris) [10:28:31] 10Puppet: Analyses octocatalog-diff output - https://phabricator.wikimedia.org/T233203 (10jbond) [10:29:05] 10Operations, 10Commons, 10MediaWiki-File-management, 10media-storage, and 2 others: bring swiftrepl back to life - https://phabricator.wikimedia.org/T231110 (10fgiunchedi) Looks like it has worked, on a container with exactly 12k objects: ` # grep wikipedia-commons-local-deleted.0n 2019-09-18-repl-common... [10:29:23] (03Merged) 10jenkins-bot: Add logging support [software/service-checker] - 10https://gerrit.wikimedia.org/r/495238 (owner: 10Alexandros Kosiaris) [10:29:26] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:29:43] ok tha is a new alert [10:30:20] (03PS3) 10Jbond: prometheus - bastion: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) [10:31:04] 10Puppet, 10Patch-For-Review: Analyses octocatalog-diff output - https://phabricator.wikimedia.org/T233203 (10jbond) there are a few nodes which have diffes like the following. This is caused when an attribute has a value of `undef` ` lang=diff diff production/cp1078.eqiad.wmnet production/cp1078.eqiad.wmnet... [10:32:29] (03CR) 10jerkins-bot: [V: 04-1] prometheus - bastion: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [10:36:14] (03PS4) 10Jbond: prometheus - bastion: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) [10:43:07] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1046.eqiad.wmnet'] ` Of which those **FAILED**: ` ['elastic1046.eqiad.wmnet'] ` [10:44:50] (03PS7) 10Filippo Giunchedi: WIP swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) [10:44:52] (03PS5) 10Filippo Giunchedi: WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 [10:46:16] (03CR) 10jerkins-bot: [V: 04-1] WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (owner: 10Filippo Giunchedi) [10:46:33] (03CR) 10Jbond: "Bug: T233203" [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [10:54:06] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 3 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10WMDE-leszek) @Dzahn the instance was no longer in use, so I've just deleted it. [10:54:19] 10Puppet, 10Patch-For-Review: Analyses octocatalog-diff output - https://phabricator.wikimedia.org/T233203 (10jbond) [11:00:04] Amir1, Lucas_WMDE, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190918T1100). [11:00:05] awight: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:26] jouncebot: Thank you for the introduction. I'll deploy my own bugs :-) [11:01:16] 10Operations, 10Traffic: Higher failed fetches error rate on some caching servers - https://phabricator.wikimedia.org/T233205 (10jijiki) [11:01:52] 10Operations, 10Traffic: Higher failed fetches error rate on some caching servers - https://phabricator.wikimedia.org/T233205 (10jijiki) [11:01:55] 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10jijiki) [11:02:17] It used to be a tshirt, I think it's due to 2008 recession [11:02:53] (03PS2) 10Awight: NowCommons test & test2wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537375 (https://phabricator.wikimedia.org/T228851) [11:03:13] (03PS8) 10Filippo Giunchedi: swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) [11:03:15] (03PS6) 10Filippo Giunchedi: WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 [11:03:50] (03CR) 10Awight: [C: 03+2] NowCommons test & test2wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537375 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:04:19] (03CR) 10jerkins-bot: [V: 04-1] WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (owner: 10Filippo Giunchedi) [11:04:40] (03Merged) 10jenkins-bot: NowCommons test & test2wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537375 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:05:20] (03CR) 10jerkins-bot: [V: 04-1] swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [11:06:23] (03PS1) 10Vgutierrez: ATS: Ensure that the origin timeout is also applied to parent servers [puppet] - 10https://gerrit.wikimedia.org/r/537625 (https://phabricator.wikimedia.org/T233205) [11:06:48] (03CR) 10jenkins-bot: NowCommons test & test2wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537375 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:08:36] (03PS9) 10Filippo Giunchedi: swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) [11:08:38] (03PS7) 10Filippo Giunchedi: WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 [11:09:34] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1001/18370/" [puppet] - 10https://gerrit.wikimedia.org/r/537625 (https://phabricator.wikimedia.org/T233205) (owner: 10Vgutierrez) [11:09:40] (03CR) 10jerkins-bot: [V: 04-1] WIP: turn on swiftrepl on swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/537613 (owner: 10Filippo Giunchedi) [11:09:53] (03PS2) 10Vgutierrez: ATS: Ensure that the origin timeout is also applied to parent servers [puppet] - 10https://gerrit.wikimedia.org/r/537625 (https://phabricator.wikimedia.org/T233205) [11:10:18] (03CR) 10Filippo Giunchedi: "Good to review" [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [11:10:49] (03PS3) 10Alexandros Kosiaris: Add discovery RRs for wikifeeds [dns] - 10https://gerrit.wikimedia.org/r/535523 (https://phabricator.wikimedia.org/T170455) [11:10:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add discovery RRs for wikifeeds [dns] - 10https://gerrit.wikimedia.org/r/535523 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [11:14:55] !log awight@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:537375|NowCommons test & test2wiki configuration (T228851)]] (duration: 01m 15s) [11:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:09] T228851: Source wiki editing and deletion always fails - https://phabricator.wikimedia.org/T228851 [11:15:22] 10Operations, 10DBA, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10akosiaris) >>! In T229209#5502350, @jcrespo wrote: > > @akosiaris I would like to disable the accidental reimage of these servers (we suffered from these on a board change... [11:16:46] (03PS2) 10Awight: Enable FileImport source wiki editing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537586 (https://phabricator.wikimedia.org/T228851) [11:18:28] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+1] sessionstore: configure cassandra for `local_dc` [deployment-charts] - 10https://gerrit.wikimedia.org/r/537552 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [11:18:52] (03Merged) 10jenkins-bot: Enable FileImport source wiki editing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537586 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:19:29] (03CR) 10jenkins-bot: Enable FileImport source wiki editing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537586 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:23:31] (03PS1) 10Abijeet Patro: Fix incorrect channel name for TranslationNotifications extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537628 (https://phabricator.wikimedia.org/T144780) [11:23:56] (03PS1) 10Jbond: swift: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) [11:24:45] (03CR) 10jerkins-bot: [V: 04-1] swift: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [11:25:52] !log awight@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:537586|Enable FileImport source wiki editing (T228851)]] (duration: 01m 03s) [11:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:03] T228851: Source wiki editing and deletion always fails - https://phabricator.wikimedia.org/T228851 [11:27:02] !log awight@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: SWAT: [[gerrit:537586|Enable FileImport source wiki editing (T228851)]] (duration: 00m 59s) [11:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:13] !log EU SWAT complete [11:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:59] (03PS2) 10Jbond: swift: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) [11:29:57] !log bootstrap restbase2011-c -- T224553 [11:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:00] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [11:30:10] (03CR) 10jerkins-bot: [V: 04-1] swift: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [11:31:23] (03PS5) 10Jbond: prometheus - bastion: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) [11:32:37] (03CR) 104nn1l2: "> Patch Set 4: Code-Review-1" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536764 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2) [11:41:23] (03CR) 10Jbond: "PCC shows many files which now have the backup attribute removed. however i suspect theses should have never had the attribute to begin w" [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [11:41:40] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:42:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Marostegui) [11:50:13] (03PS3) 10Jbond: swift: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) [11:50:52] (03CR) 10jerkins-bot: [V: 04-1] swift: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [11:51:08] (03PS4) 10Jbond: swift: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) [11:51:24] (03PS5) 10Jbond: swift: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) [11:52:09] (03CR) 10jerkins-bot: [V: 04-1] swift: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [11:55:34] (03CR) 10Jbond: "The PCC diff shows a few additional files which are effected however i suspect theses files shouldn't have had the properties to begin wit" [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [11:56:13] (03CR) 10Jbond: "The PCC diff shows a few additional files which are effected however i suspect theses files shouldn't have had the properties to begin wit" [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [11:56:42] (03CR) 10Jbond: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [11:58:35] (03PS1) 10Vgutierrez: ATS: Avoid Proxy-Connection from spreading to varnish-fe and ats-be [puppet] - 10https://gerrit.wikimedia.org/r/537630 (https://phabricator.wikimedia.org/T233205) [11:59:24] (03PS6) 10Jbond: swift: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) [12:02:27] (03CR) 10Vgutierrez: [C: 03+2] ATS: Avoid Proxy-Connection from spreading to varnish-fe and ats-be [puppet] - 10https://gerrit.wikimedia.org/r/537630 (https://phabricator.wikimedia.org/T233205) (owner: 10Vgutierrez) [12:02:35] (03PS2) 10Vgutierrez: ATS: Avoid Proxy-Connection from spreading to varnish-fe and ats-be [puppet] - 10https://gerrit.wikimedia.org/r/537630 (https://phabricator.wikimedia.org/T233205) [12:02:46] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Gehel) Tried again, seems that sda has issues (see log below). Is it that the second disk also failed? Or that the wrong disk was replaced? Or something else? @Cmjohnson... [12:03:49] !log Stop haproxy on dbproxy1006 - T233207 [12:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:52] T233207: Decommission dbproxy1006.eqiad.wmnet - https://phabricator.wikimedia.org/T233207 [12:05:59] (03CR) 10Phamhi: [C: 03+2] openstack: configure apache wsgi for keystone api [puppet] - 10https://gerrit.wikimedia.org/r/536664 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [12:06:01] (03PS6) 10Gehel: wdqs: setup new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/535345 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [12:08:28] (03CR) 10Gehel: [C: 03+2] wdqs: setup new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/535345 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [12:08:47] (03PS4) 10Gehel: wdqs: switch test cluster to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/537322 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [12:09:15] (03PS7) 10Gehel: wdqs: setup new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/535345 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [12:11:11] (03CR) 10Gehel: [C: 03+2] wdqs: switch test cluster to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/537322 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [12:12:16] (03PS5) 10Gehel: wdqs: switch test cluster to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/537322 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [12:12:18] (03CR) 10Jbond: "https://puppet-compiler.wmflabs.org/compiler1001/18380/prometheus1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [12:13:10] (03PS7) 10Jbond: swift: use per-resource defaiult attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) [12:14:17] (03PS8) 10Jbond: swift: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) [12:15:34] (03PS6) 10Jbond: prometheus - bastion: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) [12:18:24] 10Operations, 10Traffic: Higher failed fetches error rate on some caching servers - https://phabricator.wikimedia.org/T233205 (10Vgutierrez) It looks like ats-tls setting `Proxy-Connection`to `Close` is messing with varnish-fe<-->ats-be connections as it can be seen in https://grafana.wikimedia.org/d/000000352... [12:18:35] !log restarting ats-tls to avoid spreading Proxy-Connection header - T233205 [12:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:38] T233205: Higher failed fetches error rate on some caching servers - https://phabricator.wikimedia.org/T233205 [12:20:16] PROBLEM - Host an-coord1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:20:19] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10elukey) @MGerlach done! Please take a moment to review https://wikitech.wikimedia.org/wiki/LDAP/Groups#wmf_group, since your account is now able to see a... [12:21:22] RECOVERY - Host an-coord1001 is UP: PING OK - Packet loss = 0%, RTA = 10.93 ms [12:21:40] this is hw maintenance --^ [12:21:44] we are working on it in the DC [12:22:56] 10Puppet, 10Patch-For-Review: Analyses octocatalog-diff output - https://phabricator.wikimedia.org/T233203 (10jbond) nodes with changes |Host|Changes|Type of Change|remediation| | ---- | --------- | ----------------- | ------------- | |bast3002.wikimedia.org|4|Resource Defaults|https://gerrit.wikimedia.org/r/... [12:23:26] PROBLEM - Host an-coord1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:25:18] Good morning/afternoon/night operations! Just wanted to wish you all luck on all your deployments and tasks today! :) [12:26:29] 10Operations, 10Traffic: Higher failed fetches error rate on some caching servers - https://phabricator.wikimedia.org/T233205 (10Vgutierrez) 05Open→03Resolved p:05Triage→03Normal a:03Vgutierrez Solved by preventing Proxy-Connection from spreading across varnish-fe and ats-be, thanks for reporting the... [12:26:32] 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [12:33:12] (03PS1) 10Mathew.onipe: wdqs: switch production clusters to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/537634 (https://phabricator.wikimedia.org/T232184) [12:35:43] (03PS1) 10Muehlenhoff: Add access for the Icinga replication check [puppet] - 10https://gerrit.wikimedia.org/r/537635 [12:38:04] (03CR) 10Mathew.onipe: "PCC is happy: https://puppet-compiler.wmflabs.org/compiler1001/18381/" [puppet] - 10https://gerrit.wikimedia.org/r/537634 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [12:38:06] RECOVERY - Host an-coord1001 is UP: PING WARNING - Packet loss = 54%, RTA = 0.25 ms [12:40:17] !log Deploy schema change on s6 codfw master with replication T231172 [12:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:21] T231172: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 [12:43:49] (03PS2) 10Mathew.onipe: wdqs: switch production clusters to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/537634 (https://phabricator.wikimedia.org/T232184) [12:46:47] (03PS1) 10DCausse: [cirrus] glent method 0 A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537637 (https://phabricator.wikimedia.org/T233211) [12:49:15] (03CR) 10Thiemo Kreuz (WMDE): Enable FileImport source wiki editing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537586 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [12:51:07] (03CR) 10Awight: [C: 03+2] Enable FileImport source wiki editing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537586 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [12:52:55] !log gracefully stopping Zuul (kill SIGUSR1) to prepare for Jenkins restart [12:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:12] !log Deploy schema change on the following s6 hosts: db1088, db1093, db1096, db1098, db1139, dbstore1005 - T231172 [12:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:16] T231172: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 [12:59:58] Hi ops team - I'm experiencing an error trying to log into wikitech - Is that a known issue? [13:00:12] error: [fb6658b5ae6822f4b2fb70fc] /w/index.php?title=Special:UserLogin&returnto=User:Elukey/Analytics/Hadoop+testing+cluster UnderflowException from line 446 of /srv/mediawiki/w/extensions/ConfirmEdit/FancyCaptcha/includes/FancyCaptcha.php: Ran out of captcha images [13:00:36] nice [13:00:46] Some captcha related issue, I guess [13:01:36] https://wikitech.wikimedia.org/wiki/Generating_CAPTCHAs ;) [13:01:40] Probably needs more images generated [13:01:49] Oh hashar beat me to it :D [13:02:14] so yeah 1) fill a task [13:02:20] do we have to go to gimp and start creating more manually? [13:02:20] 2) generate moare captchas [13:02:41] Reedy might no more [13:02:55] not complaining, just find it amusing [13:03:09] supposedly there is a cron job to generate them [13:03:16] O_O [13:03:18] I thought there was a script in the maintenance dir for generating captchas? [13:03:25] modules/mediawiki/manifests/maintenance/generatecaptcha.pp [13:03:40] with a cron acting on first week of the day at 1:00 [13:03:55] And login actually succeeded once moving from the error page [13:03:55] hashar: not it says underflow exception [13:03:57] and hmm apparently generating 10k of them, but that might not be enough ? :-\ [13:03:59] *note [13:04:09] it could be a different bug? [13:04:43] well "Ran out of captcha images" sounds pretty clear to me ? [13:04:49] hashar: i'm pretty sure there is a task about that cron being broken [13:05:01] at least in production or something [13:09:39] https://phabricator.wikimedia.org/T230245 [13:10:30] (03PS1) 10Jbond: graphite: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537641 (https://phabricator.wikimedia.org/T233203) [13:11:28] !log Restarting Jenkins, starting Zuul [13:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:17] that worked [13:12:43] hashar: speaking of zuul, when you get a chance I'd appreciate your thoughts on https://gerrit.wikimedia.org/r/c/operations/puppet/+/537362 [13:12:51] joal: so essentially we need a task to be filled against #confirmedit and refering to the broken cron p858snake|L mentioned: T230245 [13:12:52] T230245: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 [13:13:07] joal: and I guess someone would have to dig in the cron log to see whatever might be failling [13:13:20] godog: for statsd / prometheus isn't it? It is in my review queue [13:13:38] hashar: awesome, thanks! yeah that's for statsd/prometheus [13:13:41] godog: yeah. So will hopefully come back to it with some questions tomorrow :] [13:13:52] hashar: Will create a task, thank you - Have a good deloy [13:13:59] +p [13:15:09] (03PS1) 10Mathew.onipe: wdqs: cleanup logging config after switching to new pipeline [puppet] - 10https://gerrit.wikimedia.org/r/537642 (https://phabricator.wikimedia.org/T232184) [13:15:40] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/537641 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [13:18:22] (03CR) 10Andrew Bogott: [C: 03+1] Restrict NTP servers to production networks (including frack and network gear) [puppet] - 10https://gerrit.wikimedia.org/r/531808 (owner: 10Muehlenhoff) [13:22:15] 10Operations: Problem with captcha at login on Wikitech - https://phabricator.wikimedia.org/T233215 (10JAllemandou) [13:22:21] hashar: --^ [13:22:34] (03CR) 10Mathew.onipe: "PCC output is Ok: https://puppet-compiler.wmflabs.org/compiler1002/18387/" [puppet] - 10https://gerrit.wikimedia.org/r/537642 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [13:23:50] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [13:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:13] (03PS2) 10Jbond: graphite: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537641 (https://phabricator.wikimedia.org/T233203) [13:25:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] envoy: sync base container with conventions used in production [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/537466 (owner: 10Giuseppe Lavagetto) [13:26:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:38] 10Operations, 10ConfirmEdit (CAPTCHA extension): Problem with captcha at login on Wikitech - https://phabricator.wikimedia.org/T233215 (10hashar) [13:27:03] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/537641 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [13:27:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] "maybe a README about the variables supported as well?" (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/399640 (owner: 10Giuseppe Lavagetto) [13:27:16] 10Operations, 10ConfirmEdit (CAPTCHA extension): Problem with captcha at login on Wikitech - https://phabricator.wikimedia.org/T233215 (10hashar) @Reedy might know more about the magic of the Fancy captchas. Maybe that is just the cron which is broken T230245. [13:30:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Release 0.2.0 [software/service-checker] - 10https://gerrit.wikimedia.org/r/537619 (owner: 10Giuseppe Lavagetto) [13:31:16] (03Merged) 10jenkins-bot: Release 0.2.0 [software/service-checker] - 10https://gerrit.wikimedia.org/r/537619 (owner: 10Giuseppe Lavagetto) [13:36:10] !log joal@deploy1001 Started deploy [analytics/refinery@ca30c4e]: Regular analytics weekly train [13:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:15] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [13:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:06] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537635 (owner: 10Muehlenhoff) [13:40:31] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:06] (03PS3) 10Jbond: graphite: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537641 (https://phabricator.wikimedia.org/T233203) [13:41:37] !log joal@deploy1001 Finished deploy [analytics/refinery@ca30c4e]: Regular analytics weekly train (duration: 05m 28s) [13:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:44] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10MGerlach) 05Open→03Resolved @elukey thanks, works now. Closing this taks. [13:41:46] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/537641 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [13:43:43] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/265/graphite1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/537641 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [13:46:21] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Thanks to the awesome work of @Jclark-ctr an-presto1001 and an-presto1003 are now reimaged, but an-p... [13:48:00] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Proposed fix for asw2-b: ` delete interfaces interface-range cloud-hosts1-b-eqiad member xe-4/0/5 s... [13:50:20] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [13:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:48] (03PS1) 10Ottomata: Remove admin group eventbus-admins [puppet] - 10https://gerrit.wikimedia.org/r/537645 (https://phabricator.wikimedia.org/T232122) [13:52:27] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [13:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:50] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10akosiaris) >>! In T225128#5503053, @elukey wrote: > Proposed fix for asw2-b: > > ` > delete interfaces inte... [13:56:00] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/18389/kafka-main1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/537645 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [13:56:35] (03PS1) 10Filippo Giunchedi: facilities: support sentry4 for 3 phase monitoring [puppet] - 10https://gerrit.wikimedia.org/r/537646 (https://phabricator.wikimedia.org/T229101) [13:56:43] (03CR) 10Ottomata: "@ppchelko, do you need direct access to kafka main brokers? I expect it would be useful. Perhaps I should add the cpjobqueue-admin group " [puppet] - 10https://gerrit.wikimedia.org/r/537645 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [14:02:57] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Committed: ` elukey@asw2-b-eqiad# show | compare [edit interfaces interface-range vlan-cloud-hosts1... [14:04:56] (03PS1) 10Ottomata: Rename eventbus cluster to kafka_main [puppet] - 10https://gerrit.wikimedia.org/r/537648 (https://phabricator.wikimedia.org/T232122) [14:08:13] (03PS6) 10Jhedden: openstack: configure apache wsgi for keystone api [puppet] - 10https://gerrit.wikimedia.org/r/536664 (https://phabricator.wikimedia.org/T223907) [14:09:02] (03PS1) 10Ottomata: Remove unused eventbus puppetization [puppet] - 10https://gerrit.wikimedia.org/r/537649 (https://phabricator.wikimedia.org/T232122) [14:11:44] (03CR) 10Elukey: [C: 03+1] "I am not aware of any special procedure to remove a group, should be ok!" [puppet] - 10https://gerrit.wikimedia.org/r/537645 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [14:12:33] (03CR) 10Elukey: "Change looks good to me, I am wondering if monitoring-wise this changes some settings? In theory no but better triple check with Filippo i" [puppet] - 10https://gerrit.wikimedia.org/r/537648 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [14:14:11] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/18390/kafka-main1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/537649 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [14:15:59] (03CR) 10Ottomata: [C: 03+2] "Should be actual no-op, service is already removed." [puppet] - 10https://gerrit.wikimedia.org/r/537649 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [14:16:07] (03PS2) 10Ottomata: Remove unused eventbus puppetization [puppet] - 10https://gerrit.wikimedia.org/r/537649 (https://phabricator.wikimedia.org/T232122) [14:16:13] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Remove unused eventbus puppetization [puppet] - 10https://gerrit.wikimedia.org/r/537649 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [14:16:13] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [14:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:14] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:03] (03PS2) 10Ottomata: Rename eventbus cluster to kafka_main [puppet] - 10https://gerrit.wikimedia.org/r/537648 (https://phabricator.wikimedia.org/T232122) [14:23:18] 10Operations, 10Mail, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10herron) Aliases have been added, dns records are live, and test mail to myself works. Please see https://phabricator.wikimedia.org/T231387#5488828 for currently open follow-up questions. [14:24:50] (03CR) 10Muehlenhoff: Remove admin group eventbus-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537645 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [14:29:11] (03CR) 10Ottomata: "I removed some cluster templating on the (now renamed) change-propagation grafana dashboard, but aside from that I think it will be fine. " [puppet] - 10https://gerrit.wikimedia.org/r/537648 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [14:29:33] (03CR) 10Ottomata: "No op on kafka-main1001 anyway. https://puppet-compiler.wmflabs.org/compiler1001/18391/" [puppet] - 10https://gerrit.wikimedia.org/r/537648 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [14:29:49] (03CR) 10Ottomata: [C: 03+2] Rename eventbus cluster to kafka_main [puppet] - 10https://gerrit.wikimedia.org/r/537648 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [14:35:24] (03PS1) 10Elukey: Remove cloudvirtan100X references [dns] - 10https://gerrit.wikimedia.org/r/537651 (https://phabricator.wikimedia.org/T225128) [14:35:46] 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10akosiaris) Going forward with Plan #1 (which I also find better) [14:36:08] (03PS2) 10Alexandros Kosiaris: LVS: Setup port 7233 for restbase-backend [puppet] - 10https://gerrit.wikimedia.org/r/534430 (https://phabricator.wikimedia.org/T223953) [14:41:41] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Ok so current status: * All hosts reimaged to buster and working * Renamed hostnames in netbox *... [14:42:39] (03PS2) 10Ottomata: Remove admin group eventbus-admins [puppet] - 10https://gerrit.wikimedia.org/r/537645 (https://phabricator.wikimedia.org/T232122) [14:43:54] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) I followed up with @Jclark-ctr and it seems that there is only one serial port available for the server (so we can't really change any cabling). @Cm... [14:44:15] (03PS1) 10Herron: admin: remove marble and mmarble from ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/537654 (https://phabricator.wikimedia.org/T232353) [14:45:48] (03PS1) 10Ottomata: Fix description of kafka_main in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/537655 [14:46:21] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) AWESOME thank youuuu [14:46:47] (03CR) 10Ottomata: [C: 03+2] Fix description of kafka_main in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/537655 (owner: 10Ottomata) [14:50:18] (03CR) 10CRusnov: "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/537576 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [14:51:54] (03PS1) 10Ottomata: Rename 'eventbus' grafana alerts and notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/537657 (https://phabricator.wikimedia.org/T232122) [14:53:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18393/ says ok, merging" [puppet] - 10https://gerrit.wikimedia.org/r/534430 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [14:54:10] (03PS3) 10Alexandros Kosiaris: LVS: Setup port 7233 for restbase-backend [puppet] - 10https://gerrit.wikimedia.org/r/534430 (https://phabricator.wikimedia.org/T223953) [14:54:31] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] LVS: Setup port 7233 for restbase-backend [puppet] - 10https://gerrit.wikimedia.org/r/534430 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [14:56:02] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10Papaul) ` papaul@fasw-c-codfw# run show interfaces ge-[0-1]/0/18 descriptions Interface Admin Link Description ge-0/0/18 up up frqueue2001:eth0 ge-1/0/18 up... [14:56:33] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10Papaul) [14:59:20] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: service=restbase-backend [14:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:05] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Wikimedia-production-error: Problem with captcha at login on Wikitech: UnderflowException: Ran out of captcha images - https://phabricator.wikimedia.org/T233215 (10Aklapper) [15:05:28] (03PS1) 10Dzahn: allocate mw1298 as a jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/537658 (https://phabricator.wikimedia.org/T192457) [15:06:26] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [15:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:52] (03PS2) 10CRusnov: netbox: Transparently support read-only operations for virtual machines [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) [15:08:08] 10Operations, 10LDAP-Access-Requests: NDA Request from WMDE employee Raja - https://phabricator.wikimedia.org/T231984 (10herron) Hi @raja_wmde could you please coordinate obtaining a comment of manager approval on this task? Thanks! [15:08:09] (03CR) 10CRusnov: "> Patch Set 1:" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov) [15:08:50] (03CR) 10Ottomata: [C: 03+2] Rename 'eventbus' grafana alerts and notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/537657 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:08:57] (03PS2) 10Ottomata: Rename 'eventbus' grafana alerts and notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/537657 (https://phabricator.wikimedia.org/T232122) [15:09:05] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Rename 'eventbus' grafana alerts and notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/537657 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:10:24] (03CR) 10CRusnov: Add script to rotate backup dumps, and dump with timestamp (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/533131 (https://phabricator.wikimedia.org/T231512) (owner: 10CRusnov) [15:10:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/537645 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:11:00] (03CR) 10Filippo Giunchedi: "TTBOMK this should be correct to monitor phases on sentry4 PDUs, although please double check!" [puppet] - 10https://gerrit.wikimedia.org/r/537646 (https://phabricator.wikimedia.org/T229101) (owner: 10Filippo Giunchedi) [15:11:05] (03CR) 10CRusnov: "What is the status of this?" [cookbooks] - 10https://gerrit.wikimedia.org/r/519244 (owner: 10CRusnov) [15:11:21] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) is CRITICAL: Test Get site-specific CSS returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:11:35] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/18394/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/537646 (https://phabricator.wikimedia.org/T229101) (owner: 10Filippo Giunchedi) [15:11:38] 10Operations, 10Mail, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10AAlikhan) Thank you! I tested and it's working. [15:12:02] (03CR) 10jerkins-bot: [V: 04-1] netbox: Transparently support read-only operations for virtual machines [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov) [15:12:22] (03PS2) 10CRusnov: decommission: Add Netbox state change [cookbooks] - 10https://gerrit.wikimedia.org/r/519244 [15:12:57] (03CR) 10CRusnov: decommission: Add Netbox state change (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/519244 (owner: 10CRusnov) [15:13:03] 10Operations, 10DC-Ops, 10observability, 10Patch-For-Review: Phase monitoring for new PDUs - https://phabricator.wikimedia.org/T229101 (10fgiunchedi) @RobH related to {T148541}, I took a stab at adjusting the OIDs for phase monitoring on 3 phase sentry4 PDUs in https://gerrit.wikimedia.org/r/537646 please... [15:13:09] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:14:38] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, make sure to run offboard-user on both UIDs" [puppet] - 10https://gerrit.wikimedia.org/r/537654 (https://phabricator.wikimedia.org/T232353) (owner: 10Herron) [15:14:43] (03CR) 10RobH: [C: 03+1] facilities: support sentry4 for 3 phase monitoring [puppet] - 10https://gerrit.wikimedia.org/r/537646 (https://phabricator.wikimedia.org/T229101) (owner: 10Filippo Giunchedi) [15:14:51] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [15:14:53] (03CR) 10SBassett: [C: 03+1] "Security Team approves." [puppet] - 10https://gerrit.wikimedia.org/r/537654 (https://phabricator.wikimedia.org/T232353) (owner: 10Herron) [15:15:51] RECOVERY - HTTPS non-canonical-redirect-2 on ncredir2001 is OK: SSL OK - OCSP staple validity for www.wikimania.com has 431050 seconds left:Certificate *.wikimania.com valid until 2019-11-27 14:41:45 +0000 (expires in 69 days) https://wikitech.wikimedia.org/wiki/Ncredir [15:16:15] (03PS3) 10CRusnov: netbox: Transparently support read-only operations for virtual machines [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) [15:17:41] (03CR) 10CRusnov: "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/533984 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov) [15:17:53] (03PS2) 10Herron: admin: remove marble and mmarble from ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/537654 (https://phabricator.wikimedia.org/T232353) [15:18:23] (03CR) 10Filippo Giunchedi: [C: 03+1] graphite: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537641 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [15:18:46] (03CR) 10Muehlenhoff: [C: 04-1] "You also need to update absent_ldap, sorry missed that initially." [puppet] - 10https://gerrit.wikimedia.org/r/537654 (https://phabricator.wikimedia.org/T232353) (owner: 10Herron) [15:20:59] (03CR) 10Filippo Giunchedi: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [15:21:48] (03PS1) 10Ottomata: Rename some analytics related eventbus jobs [puppet] - 10https://gerrit.wikimedia.org/r/537664 (https://phabricator.wikimedia.org/T232122) [15:23:27] (03PS3) 10Herron: admin: remove marble and mmarble from ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/537654 (https://phabricator.wikimedia.org/T232353) [15:24:03] (03CR) 10jerkins-bot: [V: 04-1] Rename some analytics related eventbus jobs [puppet] - 10https://gerrit.wikimedia.org/r/537664 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:24:23] (03CR) 10Herron: "> You also need to update absent_ldap, sorry missed that initially." [puppet] - 10https://gerrit.wikimedia.org/r/537654 (https://phabricator.wikimedia.org/T232353) (owner: 10Herron) [15:24:53] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [15:25:18] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/18395/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/537664 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:26:22] (03PS2) 10Ottomata: Rename some analytics related eventbus jobs [puppet] - 10https://gerrit.wikimedia.org/r/537664 (https://phabricator.wikimedia.org/T232122) [15:28:32] (03PS1) 10Ottomata: Fix proper cluster assignment for kafka_main [puppet] - 10https://gerrit.wikimedia.org/r/537669 (https://phabricator.wikimedia.org/T232122) [15:28:57] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix proper cluster assignment for kafka_main [puppet] - 10https://gerrit.wikimedia.org/r/537669 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:29:46] (03PS14) 10CRusnov: backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) [15:29:59] (03CR) 10CRusnov: backends: add Netbox backend (036 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [15:30:01] (03CR) 10Filippo Giunchedi: [C: 03+2] facilities: support sentry4 for 3 phase monitoring [puppet] - 10https://gerrit.wikimedia.org/r/537646 (https://phabricator.wikimedia.org/T229101) (owner: 10Filippo Giunchedi) [15:30:16] (03PS3) 10Ottomata: Rename some analytics related eventbus jobs [puppet] - 10https://gerrit.wikimedia.org/r/537664 (https://phabricator.wikimedia.org/T232122) [15:30:18] (03PS2) 10Filippo Giunchedi: facilities: support sentry4 for 3 phase monitoring [puppet] - 10https://gerrit.wikimedia.org/r/537646 (https://phabricator.wikimedia.org/T229101) [15:30:25] (03PS7) 10Jbond: prometheus - bastion: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) [15:30:38] jouncebot: now [15:30:38] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [15:30:39] jouncebot: next [15:30:40] In 0 hour(s) and 29 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190918T1600) [15:31:17] (03CR) 10Jbond: "update thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [15:31:21] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Rename some analytics related eventbus jobs [puppet] - 10https://gerrit.wikimedia.org/r/537664 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:31:29] (03PS4) 10Ottomata: Rename some analytics related eventbus jobs [puppet] - 10https://gerrit.wikimedia.org/r/537664 (https://phabricator.wikimedia.org/T232122) [15:31:31] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Rename some analytics related eventbus jobs [puppet] - 10https://gerrit.wikimedia.org/r/537664 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [15:31:39] (03CR) 10Jbond: [C: 03+2] swift: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [15:31:49] (03PS9) 10Jbond: swift: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537629 (https://phabricator.wikimedia.org/T233203) [15:32:24] I like to think we're just body-checking one another to get puppet changes in [15:32:43] (03CR) 10jerkins-bot: [V: 04-1] prometheus - bastion: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [15:34:36] 10Operations, 10serviceops: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 (10thcipriani) >>! In T230024#5502211, @jijiki wrote: > @thcipriani is it ok if we update php7 to 7.2.22 on deploy* servers? Do you know if there are any dependencies ? The only thing scap does on the depl... [15:35:41] (03CR) 10Jbond: [C: 03+2] graphite: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537641 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [15:35:53] (03CR) 10jerkins-bot: [V: 04-1] backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [15:37:50] (03PS8) 10Jbond: prometheus - bastion: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) [15:37:52] RECOVERY - ps1-b5-eqiad-infeed-load-tower-A-phase-Y on ps1-b5-eqiad is OK: SNMP OK - ps1-b5-eqiad-infeed-load-tower-A-phase-Y 205 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:38:02] RECOVERY - ps1-a3-eqiad-infeed-load-tower-B-phase-Y on ps1-a3-eqiad is OK: SNMP OK - ps1-a3-eqiad-infeed-load-tower-B-phase-Y 380 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:38:02] expected ^ [15:38:06] RECOVERY - ps1-a4-eqiad-infeed-load-tower-A-phase-Y on ps1-a4-eqiad is OK: SNMP OK - ps1-a4-eqiad-infeed-load-tower-A-phase-Y 258 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:38:06] RECOVERY - ps1-b5-eqiad-infeed-load-tower-B-phase-Y on ps1-b5-eqiad is OK: SNMP OK - ps1-b5-eqiad-infeed-load-tower-B-phase-Y 195 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:38:20] RECOVERY - ps1-a3-eqiad-infeed-load-tower-B-phase-Z on ps1-a3-eqiad is OK: SNMP OK - ps1-a3-eqiad-infeed-load-tower-B-phase-Z 426 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:38:36] RECOVERY - ps1-a3-eqiad-infeed-load-tower-B-phase-X on ps1-a3-eqiad is OK: SNMP OK - ps1-a3-eqiad-infeed-load-tower-B-phase-X 384 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:38:38] RECOVERY - ps1-a4-eqiad-infeed-load-tower-B-phase-Y on ps1-a4-eqiad is OK: SNMP OK - ps1-a4-eqiad-infeed-load-tower-B-phase-Y 241 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:39:32] (03PS2) 10Dzahn: cumin: remove labpuppetmaster alias [puppet] - 10https://gerrit.wikimedia.org/r/537536 [15:39:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10Nuria) @Urbanecm I have corrected the information about data access, sorry about that. From what i can tell the query that lead to you requesting access can... [15:39:58] (03CR) 10jerkins-bot: [V: 04-1] prometheus - bastion: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [15:41:16] RECOVERY - ps1-b6-eqiad-infeed-load-tower-A-phase-Y on ps1-b6-eqiad is OK: SNMP OK - ps1-b6-eqiad-infeed-load-tower-A-phase-Y 470 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:41:59] (03CR) 10Dzahn: [C: 03+2] cumin: remove labpuppetmaster alias [puppet] - 10https://gerrit.wikimedia.org/r/537536 (owner: 10Dzahn) [15:42:09] (03PS3) 10Dzahn: cumin: remove labpuppetmaster alias [puppet] - 10https://gerrit.wikimedia.org/r/537536 [15:42:50] (03CR) 10Herron: "> -2 because while the patch is technically correct, I think this is" [puppet] - 10https://gerrit.wikimedia.org/r/535697 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron) [15:43:10] RECOVERY - ps1-b6-eqiad-infeed-load-tower-A-phase-Z on ps1-b6-eqiad is OK: SNMP OK - ps1-b6-eqiad-infeed-load-tower-A-phase-Z 376 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:43:59] 10Operations, 10Discovery, 10Maps, 10Product-Infrastructure-Team-Backlog (Kanban): Wikimedia maps instability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817 (10Jhernandez) [15:44:40] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:44:55] 10Operations, 10Discovery, 10Maps, 10Product-Infrastructure-Team-Backlog (Kanban): Wikimedia maps instability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817 (10Jhernandez) p:05Triage→03High a:03MSantos Hey @MSantos, this seems fixed AFAIK, resolve the task if it is. Thanks! [15:45:06] RECOVERY - ps1-b6-eqiad-infeed-load-tower-B-phase-X on ps1-b6-eqiad is OK: SNMP OK - ps1-b6-eqiad-infeed-load-tower-B-phase-X 576 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:45:22] RECOVERY - ps1-b5-eqiad-infeed-load-tower-A-phase-Z on ps1-b5-eqiad is OK: SNMP OK - ps1-b5-eqiad-infeed-load-tower-A-phase-Z 258 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:45:32] (03CR) 10Ejegg: [C: 03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537525 (https://phabricator.wikimedia.org/T233145) (owner: 10AndyRussG) [15:47:26] (03PS4) 10Jbond: graphite: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537641 (https://phabricator.wikimedia.org/T233203) [15:48:22] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:48:56] RECOVERY - ps1-a4-eqiad-infeed-load-tower-A-phase-Z on ps1-a4-eqiad is OK: SNMP OK - ps1-a4-eqiad-infeed-load-tower-A-phase-Z 404 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:49:42] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:50:52] RECOVERY - ps1-a4-eqiad-infeed-load-tower-B-phase-X on ps1-a4-eqiad is OK: SNMP OK - ps1-a4-eqiad-infeed-load-tower-B-phase-X 310 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:51:06] (03PS5) 10Jbond: graphite: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537641 (https://phabricator.wikimedia.org/T233203) [15:51:08] RECOVERY - ps1-b6-eqiad-infeed-load-tower-A-phase-X on ps1-b6-eqiad is OK: SNMP OK - ps1-b6-eqiad-infeed-load-tower-A-phase-X 554 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:52:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10Urbanecm) 05Open→03Declined >>! In T231616#5503379, @Nuria wrote: > @Urbanecm I have corrected the information about data access, sorry about that. From... [15:52:48] RECOVERY - ps1-a4-eqiad-infeed-load-tower-B-phase-Z on ps1-a4-eqiad is OK: SNMP OK - ps1-a4-eqiad-infeed-load-tower-B-phase-Z 412 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:53:04] RECOVERY - ps1-a4-eqiad-infeed-load-tower-A-phase-X on ps1-a4-eqiad is OK: SNMP OK - ps1-a4-eqiad-infeed-load-tower-A-phase-X 291 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:53:04] RECOVERY - ps1-b6-eqiad-infeed-load-tower-B-phase-Y on ps1-b6-eqiad is OK: SNMP OK - ps1-b6-eqiad-infeed-load-tower-B-phase-Y 416 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:53:08] !log decommissioning Cassandra, restbase2012-a -- T224553 [15:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:14] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [15:53:57] (03Abandoned) 10Herron: admin: add urbanecm to researchers, analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/537524 (https://phabricator.wikimedia.org/T231616) (owner: 10Herron) [15:54:42] RECOVERY - ps1-b5-eqiad-infeed-load-tower-A-phase-X on ps1-b5-eqiad is OK: SNMP OK - ps1-b5-eqiad-infeed-load-tower-A-phase-X 241 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:54:57] (03CR) 10Jbond: [C: 03+2] graphite: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537641 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [15:55:02] RECOVERY - ps1-b6-eqiad-infeed-load-tower-B-phase-Z on ps1-b6-eqiad is OK: SNMP OK - ps1-b6-eqiad-infeed-load-tower-B-phase-Z 413 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:55:05] (03PS6) 10Jbond: graphite: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537641 (https://phabricator.wikimedia.org/T233203) [15:55:43] !log repooling restbase2011 after reimage/bootstrap [15:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:01] (03PS9) 10Jbond: prometheus - bastion: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) [15:56:36] RECOVERY - ps1-b5-eqiad-infeed-load-tower-B-phase-X on ps1-b5-eqiad is OK: SNMP OK - ps1-b5-eqiad-infeed-load-tower-B-phase-X 269 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:57:43] (03PS10) 10Jbond: prometheus - bastion: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) [15:58:17] (03PS3) 10Ottomata: Remove eventbus-admins group users, use cp admins on kafka_main [puppet] - 10https://gerrit.wikimedia.org/r/537645 (https://phabricator.wikimedia.org/T232122) [15:58:36] RECOVERY - ps1-b5-eqiad-infeed-load-tower-B-phase-Z on ps1-b5-eqiad is OK: SNMP OK - ps1-b5-eqiad-infeed-load-tower-B-phase-Z 229 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:59:20] 10Operations, 10MediaWiki-extensions-OATHAuth: Cannot enable 2FA on testwiki - https://phabricator.wikimedia.org/T233146 (10Reedy) >>! In T233146#5502352, @ItSpiderman wrote: > Hm, it cant be an array, key can only be generated from > > `TOTPKey::newFromRandom()` > > which cannot return an array. Thats wei... [16:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190918T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:17] (03CR) 10Ottomata: [C: 03+2] Remove eventbus-admins group users, use cp admins on kafka_main [puppet] - 10https://gerrit.wikimedia.org/r/537645 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [16:00:37] 10Operations, 10MediaWiki-extensions-OATHAuth: Cannot enable 2FA on testwiki - https://phabricator.wikimedia.org/T233146 (10LucasWerkmeister) Firefox, probably 70, I can check the exact version when I’m home. [16:02:05] * Urbanecm takes the empty window [16:02:43] (03CR) 10Urbanecm: [C: 03+2] Add suppressredirect right to filemovers on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537534 (https://phabricator.wikimedia.org/T233137) (owner: 10Zoranzoki21) [16:03:15] Amir1: when do you plan to try the wiki thing? [16:03:33] in one hour [16:03:38] deployment window [16:03:39] (03Merged) 10jenkins-bot: Add suppressredirect right to filemovers on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537534 (https://phabricator.wikimedia.org/T233137) (owner: 10Zoranzoki21) [16:03:55] Amir1: yes, empty deployment window (I'm currently deploying) :) [16:03:58] (03CR) 10jenkins-bot: Add suppressredirect right to filemovers on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537534 (https://phabricator.wikimedia.org/T233137) (owner: 10Zoranzoki21) [16:05:45] !log urbanecm@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: ba30276: Add suppressredirect right to filemovers on bnwiki (T233137) (duration: 01m 05s) [16:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:48] T233137: Update $wgGroupPermissions for bnwiki - https://phabricator.wikimedia.org/T233137 [16:08:46] 10Operations, 10Discovery, 10Maps, 10Product-Infrastructure-Team-Backlog (Kanban): Wikimedia maps instability (maps.wikimedia.org) - https://phabricator.wikimedia.org/T232817 (10MSantos) 05Open→03Resolved FYI the incident documentation is https://wikitech.wikimedia.org/wiki/Incident_documentation/20190... [16:09:12] (03PS1) 10Ottomata: Remove absent job eventlogging_eventbus_job_queue [puppet] - 10https://gerrit.wikimedia.org/r/537679 (https://phabricator.wikimedia.org/T232122) [16:12:28] 10Operations, 10Maps (Kartotherian), 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban), 10Wikimedia-Incident: Create test in spec.yaml for the kartotherian / geoshape service - https://phabricator.wikimedia.org/T217910 (10MSantos) 05Open→03Resolved [16:12:52] (03CR) 10Ottomata: [C: 03+2] Remove absent job eventlogging_eventbus_job_queue [puppet] - 10https://gerrit.wikimedia.org/r/537679 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [16:13:06] (03PS2) 10Urbanecm: Change Telugu Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535259 (https://phabricator.wikimedia.org/T232065) (owner: 10Jayprakash12345) [16:13:21] (03PS3) 10Urbanecm: Change Telugu Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535259 (https://phabricator.wikimedia.org/T232065) (owner: 10Jayprakash12345) [16:13:25] (03CR) 10Urbanecm: [C: 03+2] Change Telugu Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535259 (https://phabricator.wikimedia.org/T232065) (owner: 10Jayprakash12345) [16:14:03] (03PS2) 10Urbanecm: Turn on EventLogging at 100% for DonateWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537525 (https://phabricator.wikimedia.org/T233145) (owner: 10AndyRussG) [16:14:17] (03CR) 10Urbanecm: [C: 03+2] Turn on EventLogging at 100% for DonateWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537525 (https://phabricator.wikimedia.org/T233145) (owner: 10AndyRussG) [16:14:19] (03CR) 10jerkins-bot: [V: 04-1] Change Telugu Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535259 (https://phabricator.wikimedia.org/T232065) (owner: 10Jayprakash12345) [16:14:21] (03CR) 10jerkins-bot: [V: 04-1] Change Telugu Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535259 (https://phabricator.wikimedia.org/T232065) (owner: 10Jayprakash12345) [16:15:07] (03Merged) 10jenkins-bot: Turn on EventLogging at 100% for DonateWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537525 (https://phabricator.wikimedia.org/T233145) (owner: 10AndyRussG) [16:15:42] (03PS1) 10Ottomata: Rename camus job eventbus to mediawiki_events [puppet] - 10https://gerrit.wikimedia.org/r/537681 (https://phabricator.wikimedia.org/T232122) [16:16:26] (03CR) 10jenkins-bot: Turn on EventLogging at 100% for DonateWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537525 (https://phabricator.wikimedia.org/T233145) (owner: 10AndyRussG) [16:16:31] (03PS4) 10Urbanecm: Change Telugu Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535259 (https://phabricator.wikimedia.org/T232065) (owner: 10Jayprakash12345) [16:17:07] (03PS5) 10Urbanecm: Change Telugu Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535259 (https://phabricator.wikimedia.org/T232065) (owner: 10Jayprakash12345) [16:17:15] (03CR) 10Urbanecm: [C: 03+2] Change Telugu Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535259 (https://phabricator.wikimedia.org/T232065) (owner: 10Jayprakash12345) [16:18:04] (03CR) 10Elukey: Adding config for friendly values on netflow dataset (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537564 (https://phabricator.wikimedia.org/T229682) (owner: 10Nuria) [16:18:06] (03Merged) 10jenkins-bot: Change Telugu Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535259 (https://phabricator.wikimedia.org/T232065) (owner: 10Jayprakash12345) [16:18:08] (03CR) 10Ottomata: [C: 03+2] Rename camus job eventbus to mediawiki_events [puppet] - 10https://gerrit.wikimedia.org/r/537681 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [16:18:16] 10Operations: Add Urbanecm to #mediawiki_security - https://phabricator.wikimedia.org/T233235 (10Urbanecm) [16:18:22] (03CR) 10jenkins-bot: Change Telugu Wikisource Logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535259 (https://phabricator.wikimedia.org/T232065) (owner: 10Jayprakash12345) [16:18:38] 10Operations, 10Cloud-Services: clarification of cloud terms of use regarding LDAP servers - https://phabricator.wikimedia.org/T233158 (10bd808) > it has been said that cloud VPSes can use these ro-replicas for testing. For example to get a Gerrit running in cloud VPS to be able to test changes before applying... [16:18:45] !log urbanecm@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: 817d679: Turn on EventLogging at 100% for DonateWiki (T233145) (duration: 01m 04s) [16:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:48] T233145: Turn on EventLogging from Donate Wiki at 100% - https://phabricator.wikimedia.org/T233145 [16:19:08] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:26] (03PS3) 10AndyRussG: Disable FundraiserLandingPage extension on test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536796 (https://phabricator.wikimedia.org/T203020) [16:22:00] (03PS1) 10Andrew Bogott: Revert "cloudweb2001-dev: remove wikitech profiles" [puppet] - 10https://gerrit.wikimedia.org/r/537683 [16:22:30] (03CR) 10Urbanecm: [C: 03+2] Disable FundraiserLandingPage extension on test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536796 (https://phabricator.wikimedia.org/T203020) (owner: 10AndyRussG) [16:22:50] (03PS1) 10Ottomata: Use $ensure param for camus job properties file [puppet] - 10https://gerrit.wikimedia.org/r/537684 (https://phabricator.wikimedia.org/T232122) [16:23:05] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Use $ensure param for camus job properties file [puppet] - 10https://gerrit.wikimedia.org/r/537684 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [16:23:22] (03Merged) 10jenkins-bot: Disable FundraiserLandingPage extension on test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536796 (https://phabricator.wikimedia.org/T203020) (owner: 10AndyRussG) [16:23:42] (03CR) 10jenkins-bot: Disable FundraiserLandingPage extension on test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536796 (https://phabricator.wikimedia.org/T203020) (owner: 10AndyRussG) [16:24:03] (03PS2) 10Urbanecm: Add Draft and Draft_talk aliases for wikis that define draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510780 (https://phabricator.wikimedia.org/T223472) (owner: 10Petar.petkovic) [16:24:09] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: 7c987fc: Change Telugu Wikisource Logo (T232065; 1/2) (duration: 01m 05s) [16:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:13] T232065: Update logo for Telugu Wikisource - https://phabricator.wikimedia.org/T232065 [16:25:16] (03PS3) 10Urbanecm: Add Draft and Draft_talk aliases for wikis that define draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510780 (https://phabricator.wikimedia.org/T223472) (owner: 10Petar.petkovic) [16:25:31] !log urbanecm@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: 7c987fc: Change Telugu Wikisource Logo (T232065; 2/2) (duration: 01m 06s) [16:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:19] !log Purge https://en.wikipedia.org/static/images/project-logos/tewikisource.png (T232065) [16:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:38] (03CR) 10Urbanecm: [C: 03+2] Add Draft and Draft_talk aliases for wikis that define draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510780 (https://phabricator.wikimedia.org/T223472) (owner: 10Petar.petkovic) [16:26:47] (03PS1) 10Ottomata: Don't render camus::job properties file if not ensure => present [puppet] - 10https://gerrit.wikimedia.org/r/537687 (https://phabricator.wikimedia.org/T232122) [16:27:23] James_F: I see "Function already defined: wmfGetVariantSettings in /srv/mediawiki/wmf-config/InitialiseSettings.php" in fatalmonitor. Is that expected? [16:27:27] (03Merged) 10jenkins-bot: Add Draft and Draft_talk aliases for wikis that define draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510780 (https://phabricator.wikimedia.org/T223472) (owner: 10Petar.petkovic) [16:27:42] (03CR) 10jenkins-bot: Add Draft and Draft_talk aliases for wikis that define draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510780 (https://phabricator.wikimedia.org/T223472) (owner: 10Petar.petkovic) [16:28:10] Urbanecm: Yeah, sadly. [16:28:34] Urbanecm: It shouldn't happen often. Fix is on its way. [16:28:53] cool, thanks James_F [16:29:14] (03PS2) 10Ottomata: Don't render camus::job properties file if not ensure => present [puppet] - 10https://gerrit.wikimedia.org/r/537687 (https://phabricator.wikimedia.org/T232122) [16:30:24] (03CR) 10Ottomata: [C: 03+2] Don't render camus::job properties file if not ensure => present [puppet] - 10https://gerrit.wikimedia.org/r/537687 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [16:31:09] !log urbanecm@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: 6e59651: Disable FundraiserLandingPage extension on test.wikipedia.org (T203020) (duration: 01m 04s) [16:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:12] T203020: Turn off LandingPage events from testwiki - https://phabricator.wikimedia.org/T203020 [16:32:14] (03CR) 10Eevans: [V: 03+2 C: 03+2] sessionstore: configure cassandra for `local_dc` [deployment-charts] - 10https://gerrit.wikimedia.org/r/537552 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [16:33:32] (03PS1) 10Ottomata: Remove absent job camus eventbus [puppet] - 10https://gerrit.wikimedia.org/r/537688 (https://phabricator.wikimedia.org/T232122) [16:35:15] !log urbanecm@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: dc1298d: Add Draft and Draft_talk aliases for wikis that define draft namespace (T223472) (duration: 01m 02s) [16:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:18] T223472: Draft namespace not showing as publish option for some wikis - https://phabricator.wikimedia.org/T223472 [16:36:01] (03CR) 10Ottomata: [C: 03+2] Remove absent job camus eventbus [puppet] - 10https://gerrit.wikimedia.org/r/537688 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [16:36:48] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'sessionstore' for release 'production' . [16:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:16] (03PS4) 10Urbanecm: Enable logging for BlockManager channel at info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531299 (https://phabricator.wikimedia.org/T230822) [16:39:31] (03PS5) 10Urbanecm: Enable logging for BlockManager channel at info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531299 (https://phabricator.wikimedia.org/T230822) [16:39:36] (03CR) 10Urbanecm: [C: 03+2] Enable logging for BlockManager channel at info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531299 (https://phabricator.wikimedia.org/T230822) (owner: 10Urbanecm) [16:40:31] (03Merged) 10jenkins-bot: Enable logging for BlockManager channel at info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531299 (https://phabricator.wikimedia.org/T230822) (owner: 10Urbanecm) [16:41:53] (03CR) 10Ottomata: [C: 03+2] Release debian version 0.8.0+core0.6.9~1-1 [debs/python-ua-parser] (debian) - 10https://gerrit.wikimedia.org/r/537123 (owner: 10Ottomata) [16:42:28] !log urbanecm@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: 8340be9: Enable logging for BlockManager channel at info level (test) (duration: 01m 04s) [16:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:55] (03CR) 10jenkins-bot: Enable logging for BlockManager channel at info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531299 (https://phabricator.wikimedia.org/T230822) (owner: 10Urbanecm) [16:42:58] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'sessionstore' for release 'production' . [16:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:45] !log 8340be9 sync is for T230822, mistakenly inserted `test` instead of the task number [16:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:48] T230822: Include `dnsblacklist` in logs - https://phabricator.wikimedia.org/T230822 [16:45:40] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:02] (03PS1) 10Urbanecm: Enable DNS blacklist on testwiki temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537691 [16:46:57] (03PS2) 10Urbanecm: Enable DNS blacklist on testwiki temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537691 (https://phabricator.wikimedia.org/T230822) [16:47:11] (03CR) 10Urbanecm: [C: 03+2] Enable DNS blacklist on testwiki temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537691 (https://phabricator.wikimedia.org/T230822) (owner: 10Urbanecm) [16:48:00] (03Merged) 10jenkins-bot: Enable DNS blacklist on testwiki temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537691 (https://phabricator.wikimedia.org/T230822) (owner: 10Urbanecm) [16:48:39] !log urbanecm@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: [[:gerrit:537691|Enable DNS blacklist on testwiki temporarily]] (T230822) (duration: 01m 03s) [16:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:37] (03PS1) 10Papaul: DNS: Add mgmt and production DNS entires for frqueue2001 [dns] - 10https://gerrit.wikimedia.org/r/537693 [16:49:39] (03PS1) 10Andrew Bogott: codfw1dev: move designate to 'Newton' [puppet] - 10https://gerrit.wikimedia.org/r/537692 [16:49:41] (03CR) 10jenkins-bot: Enable DNS blacklist on testwiki temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537691 (https://phabricator.wikimedia.org/T230822) (owner: 10Urbanecm) [16:50:01] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and production DNS entires for frqueue2001 [dns] - 10https://gerrit.wikimedia.org/r/537693 (owner: 10Papaul) [16:54:34] (03PS2) 10Papaul: DNS: Add mgmt and production DNS entires for frqueue2001 [dns] - 10https://gerrit.wikimedia.org/r/537693 [16:55:16] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and production DNS entires for frqueue2001 [dns] - 10https://gerrit.wikimedia.org/r/537693 (owner: 10Papaul) [16:59:16] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 57.41 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:59:38] (03PS3) 10Papaul: DNS: Add mgmt and production DNS entires for frqueue2001 [dns] - 10https://gerrit.wikimedia.org/r/537693 [17:00:04] Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Creating hiwikisource. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190918T1700). [17:00:32] let's start [17:00:32] !log Morning SWAT done [17:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:37] * Urbanecm waves to Amir1 [17:00:44] * Amir1 waves back [17:01:13] for now, let's merge the backports [17:01:29] technically you don't need to sync it, just scap pull on mwmaint1002 [17:01:52] you mean the T212881 backports? [17:01:53] T212881: addWiki.php broken creating ES tables - https://phabricator.wikimedia.org/T212881 [17:02:00] (03PS1) 10Filippo Giunchedi: facilities: set model for ps1-b3-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/537696 (https://phabricator.wikimedia.org/T233129) [17:02:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: update puppet for new PDU models - https://phabricator.wikimedia.org/T233129 (10fgiunchedi) I misspoke, ps1-b3-eqiad was still missing the updated model, will be fixed by https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/537696 [17:02:35] James_F, did you have time to look at my issue with beta cluster? [17:03:19] !log ganeti2004 - resetting DRAC in an attempt to make IPMI work again [17:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:21] raynor: Not yet, sorry. My next step was going to be setting it explicitly in VS. [17:03:36] (03CR) 10Filippo Giunchedi: [C: 03+2] facilities: set model for ps1-b3-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/537696 (https://phabricator.wikimedia.org/T233129) (owner: 10Filippo Giunchedi) [17:03:43] (03PS2) 10Filippo Giunchedi: facilities: set model for ps1-b3-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/537696 (https://phabricator.wikimedia.org/T233129) [17:03:43] it has to be all false in VS [17:04:08] that's the feature we still work on. I can create patch for VS and assign it to you [17:04:14] (03CR) 10Jgreen: [C: 03+2] DNS: Add mgmt and production DNS entires for frqueue2001 [dns] - 10https://gerrit.wikimedia.org/r/537693 (owner: 10Papaul) [17:04:34] Urbanecm: yup [17:05:09] okay Amir1 [17:06:34] (03PS16) 10Ladsgroup: Initial configuration for hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498317 (https://phabricator.wikimedia.org/T218155) (owner: 10Urbanecm) [17:07:20] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 59.83 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:08:32] (03PS2) 10Andrew Bogott: codfw1dev: move designate to 'Newton' [puppet] - 10https://gerrit.wikimedia.org/r/537692 [17:08:34] (03PS1) 10Andrew Bogott: Openstack/Newton/Stretch: apt package config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 [17:08:36] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10fgiunchedi) While fixing phase check for new PDUs today I noticed tower B for ps1-a7-eqiad shows unknown while tower A is fine: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ps1-a7-eqiad per... [17:09:20] (03CR) 10jerkins-bot: [V: 04-1] Openstack/Newton/Stretch: apt package config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 (owner: 10Andrew Bogott) [17:09:48] (03CR) 10Volans: "LGTM, one small detail inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov) [17:10:01] (03CR) 10Ladsgroup: [C: 03+2] Initial configuration for hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498317 (https://phabricator.wikimedia.org/T218155) (owner: 10Urbanecm) [17:10:20] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 84.53 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:10:55] (03Merged) 10jenkins-bot: Initial configuration for hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498317 (https://phabricator.wikimedia.org/T218155) (owner: 10Urbanecm) [17:11:00] yay!/ [17:11:13] (03CR) 10jenkins-bot: Initial configuration for hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498317 (https://phabricator.wikimedia.org/T218155) (owner: 10Urbanecm) [17:11:14] RECOVERY - ps1-b3-eqiad-infeed-load-tower-A-phase-Z on ps1-b3-eqiad is OK: SNMP OK - ps1-b3-eqiad-infeed-load-tower-A-phase-Z 369 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:11:16] there will a bunch of recoveries for PDUs coming up, expected [17:11:19] yes, that [17:11:31] (03PS2) 10Andrew Bogott: Openstack/Newton/Stretch: apt package config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 [17:11:31] Urbanecm: it's not in wikidataclient.dblist, is it intentional? [17:11:33] (03PS3) 10Andrew Bogott: codfw1dev: move designate to 'Newton' [puppet] - 10https://gerrit.wikimedia.org/r/537692 [17:11:36] RECOVERY - ps1-b3-eqiad-infeed-load-tower-A-phase-Y on ps1-b3-eqiad is OK: SNMP OK - ps1-b3-eqiad-infeed-load-tower-A-phase-Y 418 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:11:50] RECOVERY - ps1-b3-eqiad-infeed-load-tower-A-phase-X on ps1-b3-eqiad is OK: SNMP OK - ps1-b3-eqiad-infeed-load-tower-A-phase-X 472 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:11:52] no, I oversaw that :/ [17:11:54] RECOVERY - ps1-b3-eqiad-infeed-load-tower-B-phase-Z on ps1-b3-eqiad is OK: SNMP OK - ps1-b3-eqiad-infeed-load-tower-B-phase-Z 354 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:12:01] I'll upload a patch [17:12:09] (03CR) 10jerkins-bot: [V: 04-1] Openstack/Newton/Stretch: apt package config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 (owner: 10Andrew Bogott) [17:12:27] Urbanecm: Already on it [17:12:30] ok [17:12:59] (03PS1) 10Ladsgroup: Add hiwikisource to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537700 (https://phabricator.wikimedia.org/T218155) [17:13:05] https://wikitech.wikimedia.org/wiki/Add_a_wiki#Database_creation [17:13:22] (03CR) 10Ladsgroup: [C: 03+2] Add hiwikisource to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537700 (https://phabricator.wikimedia.org/T218155) (owner: 10Ladsgroup) [17:13:29] Amir1: yup, have that opened :) [17:14:08] (03Merged) 10jenkins-bot: Add hiwikisource to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537700 (https://phabricator.wikimedia.org/T218155) (owner: 10Ladsgroup) [17:14:10] I guess `mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=aawiki hi wikisource hiwikisource hi.wikisource.org should work Amir1 ? [17:14:23] (03CR) 10jenkins-bot: Add hiwikisource to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537700 (https://phabricator.wikimedia.org/T218155) (owner: 10Ladsgroup) [17:14:34] yeah but you should not deploy [17:14:47] first you need to pull the new thing into mwmaint1002 [17:14:58] I also want to make sure the fix is in addWiki.php [17:15:14] I guess that's because deploying would break the other sites, before the DB would be created? [17:18:10] yup [17:18:17] It would explode like nothing else [17:19:01] got it [17:20:02] okay, the database seems to be created [17:20:06] let me double check [17:20:47] okay, it's okay-ish [17:21:08] db1078 seems to have the DB [17:21:57] The ES things is the tricky part, We can't say for sure things are fixed until I get to see the main page on mwdebug1002 [17:22:09] now syncing the dblists [17:22:16] !log authdns-update to deploy DNS for new fundraising host [17:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:24] !log ladsgroup@deploy1001 Synchronized dblists: (no justification provided) (duration: 01m 06s) [17:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:44] okay Amir1 [17:23:58] (03PS1) 10Ladsgroup: Add hiwikisource to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537701 (https://phabricator.wikimedia.org/T218155) [17:24:24] (03CR) 10Ladsgroup: [C: 03+2] Add hiwikisource to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537701 (https://phabricator.wikimedia.org/T218155) (owner: 10Ladsgroup) [17:24:38] RECOVERY - ps1-b3-eqiad-infeed-load-tower-B-phase-X on ps1-b3-eqiad is OK: SNMP OK - ps1-b3-eqiad-infeed-load-tower-B-phase-X 410 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:24:38] Amir1: ^^was manual, or with some cmd? [17:24:48] manual [17:25:00] okay [17:25:03] One of the bullet points in https://wikitech.wikimedia.org/wiki/Add_a_wiki#MediaWiki_configuration [17:25:13] (03Merged) 10jenkins-bot: Add hiwikisource to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537701 (https://phabricator.wikimedia.org/T218155) (owner: 10Ladsgroup) [17:26:22] now scap sync-wikiversions [17:26:29] (03CR) 10jenkins-bot: Add hiwikisource to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537701 (https://phabricator.wikimedia.org/T218155) (owner: 10Ladsgroup) [17:26:36] RECOVERY - ps1-b3-eqiad-infeed-load-tower-B-phase-Y on ps1-b3-eqiad is OK: SNMP OK - ps1-b3-eqiad-infeed-load-tower-B-phase-Y 421 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:27:16] okay Amir1 [17:28:02] !log ladsgroup@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided) [17:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:54] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10RobH) p:05Triage→03Normal [17:28:56] pulled it in mwdebug1002. It fatals again [17:29:06] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10RobH) [17:29:20] Going to make myself admin, delete the page and Create it again [17:30:18] okay [17:33:32] !log mwscript maintenance/createAndPromote.php --wiki=hiwikisource --force --sysop Ladsgroup (T218155) [17:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:36] T218155: Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 [17:35:06] I can't delete it :/ [17:35:48] Back to good old days of resetting the value manually on database [17:36:32] ok [17:38:27] (03PS1) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [17:38:40] !log manual write on hiwikisource "wikiadmin@10.64.0.205(hiwikisource)> update text set old_text = 'DB://cluster25/1';" (T218155) [17:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:43] T218155: Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 [17:39:58] Amir1: so...the fix is not working? [17:40:06] or we just didn't know the full truth? [17:40:09] yup... [17:40:41] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add hiwikisource (T218155) (duration: 01m 04s) [17:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:48] which variant the yup belongs to Amir1 ? [17:41:55] good point [17:42:00] We need to update the docs [17:42:32] (03PS3) 10Andrew Bogott: Openstack/Newton/Stretch: apt package config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 [17:42:34] (03PS4) 10Andrew Bogott: codfw1dev: move designate to 'Newton' [puppet] - 10https://gerrit.wikimedia.org/r/537692 [17:42:36] (03PS2) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [17:43:09] (03CR) 10jerkins-bot: [V: 04-1] Openstack/Newton/Stretch: apt package config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 (owner: 10Andrew Bogott) [17:43:16] !log ladsgroup@deploy1001 Synchronized wmf-config/VariantSettings.php: Add hiwikisource (T218155) (duration: 01m 05s) [17:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:29] from my understanding, it seems previously, even creating ES database failed badly. Is that right, Amir1 ? [17:43:56] Yes but also it got fixed last time we tried in Wikimania [17:44:07] basically nothing has changed since Wikimania [17:45:12] :( [17:45:16] sad to hear [17:45:58] !log ladsgroup@deploy1001 Synchronized static/images/project-logos/: Add hiwikisource logos (T218155) (duration: 01m 04s) [17:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:01] T218155: Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 [17:46:35] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537707 [17:46:36] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537707 (owner: 10Ladsgroup) [17:46:40] (03CR) 10Thiemo Kreuz (WMDE): Enable FileImport source wiki editing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537586 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [17:47:15] (03PS4) 10Andrew Bogott: Openstack/Newton/Stretch: more config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 [17:47:17] (03PS5) 10Andrew Bogott: codfw1dev: move designate to 'Newton' [puppet] - 10https://gerrit.wikimedia.org/r/537692 [17:47:19] (03PS3) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [17:47:37] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537707 (owner: 10Ladsgroup) [17:47:43] Amir1: is there any way how I can help? [17:47:57] (03CR) 10jerkins-bot: [V: 04-1] Openstack/Newton/Stretch: more config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 (owner: 10Andrew Bogott) [17:48:08] I don't think so :( [17:48:12] thanks for the offer though [17:48:31] Also thanks for making the initial config patch, it's a very tedious thing [17:48:47] !log ladsgroup@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 32s) [17:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:52] you're welcome :) [17:49:28] okay, the interwiki is now working too [17:49:31] we are basically done [17:50:23] good! thanks Amir1 ! [17:50:51] !log decommissioning Cassandra, restbase2012-b -- T224553 [17:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:54] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [17:51:11] (03CR) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537707 (owner: 10Ladsgroup) [17:51:44] (03PS5) 10Andrew Bogott: Openstack/Newton/Stretch: more config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 [17:51:46] (03PS6) 10Andrew Bogott: codfw1dev: move designate to 'Newton' [puppet] - 10https://gerrit.wikimedia.org/r/537692 [17:51:48] (03PS4) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [17:52:25] (03CR) 10jerkins-bot: [V: 04-1] Openstack/Newton/Stretch: more config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 (owner: 10Andrew Bogott) [17:52:55] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Ladsgroup) 05Open→03Resolved It should be okay now. For adding support to Wikidata, please open another ticket and add #wikidata. We will... [17:53:22] !log Creating hiwikisource is done [17:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:26] (03PS6) 10Andrew Bogott: Openstack/Newton/Stretch: more config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 [17:54:28] (03PS7) 10Andrew Bogott: codfw1dev: move designate to 'Newton' [puppet] - 10https://gerrit.wikimedia.org/r/537692 [17:54:30] (03PS5) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [17:55:04] (03CR) 10jerkins-bot: [V: 04-1] Openstack/Newton/Stretch: more config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 (owner: 10Andrew Bogott) [17:56:53] (03PS17) 10Cwhite: hiera: disable statsd_exporter::relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [17:57:36] (03PS7) 10Andrew Bogott: Openstack/Newton/Stretch: more config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 [17:57:38] (03PS8) 10Andrew Bogott: codfw1dev: move designate to 'Newton' [puppet] - 10https://gerrit.wikimedia.org/r/537692 [17:57:40] (03PS6) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [17:58:04] * Amir1 goes back to his batcave [17:58:19] (03CR) 10jerkins-bot: [V: 04-1] Openstack/Newton/Stretch: more config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 (owner: 10Andrew Bogott) [17:59:18] (03PS4) 10Herron: admin: remove marble and mmarble from ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/537654 (https://phabricator.wikimedia.org/T232353) [17:59:32] (03CR) 10Cwhite: "https://puppet-compiler.wmflabs.org/compiler1001/18405/" [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190918T1800) [18:01:23] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Yann) Great! Thanks a lot to all who helped! [18:01:28] (03CR) 10Herron: [C: 03+2] admin: remove marble and mmarble from ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/537654 (https://phabricator.wikimedia.org/T232353) (owner: 10Herron) [18:04:09] (03PS8) 10Andrew Bogott: Openstack/Newton/Stretch: more config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 [18:04:11] (03PS9) 10Andrew Bogott: codfw1dev: move designate to 'Newton' [puppet] - 10https://gerrit.wikimedia.org/r/537692 [18:04:15] (03PS7) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [18:04:46] (03CR) 10jerkins-bot: [V: 04-1] Openstack/Newton/Stretch: more config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 (owner: 10Andrew Bogott) [18:06:30] (03PS9) 10Andrew Bogott: Openstack/Newton/Stretch: more config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 [18:06:32] (03PS10) 10Andrew Bogott: codfw1dev: move designate to 'Newton' [puppet] - 10https://gerrit.wikimedia.org/r/537692 [18:06:34] (03PS8) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [18:07:13] (03CR) 10jerkins-bot: [V: 04-1] Openstack/Newton/Stretch: more config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 (owner: 10Andrew Bogott) [18:08:47] (03PS10) 10Andrew Bogott: Openstack/Newton/Stretch: more config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 [18:08:49] (03PS11) 10Andrew Bogott: codfw1dev: move designate to 'Newton' [puppet] - 10https://gerrit.wikimedia.org/r/537692 [18:08:51] (03PS9) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [18:09:38] (03CR) 10jerkins-bot: [V: 04-1] Openstack/Newton/Stretch: more config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 (owner: 10Andrew Bogott) [18:10:29] PROBLEM - Juniper alarms on asw2-a-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [18:10:33] PROBLEM - Juniper alarms on cr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 2 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [18:20:43] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10Krenair) >>! In T231616#5502047, @Nuria wrote: > Sorry this is disappointing but given our very limited resources we really cannot support ad-hoc data acces... [18:23:10] 04Critical Alert for device cr1-eqiad.wikimedia.org - Juniper alarm active [18:23:40] looking [18:24:09] we lost power [18:24:38] wiki_willy: anyone still doing work in eqiad? [18:25:32] cr1-eqiad> show system alarms [18:25:32] Alarm time Class Description [18:25:32] 2019-09-18 18:06:21 UTC Major PEM 0 Input Failure [18:25:32] 2019-09-18 18:06:21 UTC Major PEM 0 Not OK [18:25:32] asw2-a-eqiad> show system alarms [18:25:33] Alarm time Class Description [18:25:33] 2019-09-18 18:06:21 UTC Major FPC 1 PEM 0 is not powered [18:25:36] cmjohnson: and jclark-ctr [18:25:50] cmjohnson1: and jclark-ctr: ^ [18:26:22] Chris is on his way home [18:27:01] wiki_willy: it's possible that we only lost the top half of one A1 PDU [18:27:04] see https://netbox.wikimedia.org/dcim/racks/1/ [18:27:10] 04Critical Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Juniper alarm active [18:27:24] as the other servers didn't alert [18:27:33] did the power fail over to the redundant pdu? [18:27:58] the devices are not down, so yes [18:28:37] ok, cool....john usually starts in about an hour....are you ok with him checking it out then? or i can call remote hands if needed before then [18:28:57] wiki_willy: 1h is fine, can you open a task or should I? [18:29:11] go for it...you can include any specific details in there [18:29:18] ok [18:29:23] thanks [18:29:24] 10Operations, 10Cloud-Services, 10procurement, 10cloud-services-team (Kanban): ssl renewal: *.wmflabs.org expires 2019-11-16 - https://phabricator.wikimedia.org/T233176 (10Krenair) @Bstorm, it might need procurement if we've ruled out LE, but AFAIK that is still a perfectly valid option? [18:30:13] PROBLEM - IPMI Sensor Status on dns1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:30:42] wiki_willy: seems like it's a full PDU ^ the servers on the bottom half are alerting [18:31:14] ok, could be that the breaker tripped on that powerstrip [18:31:19] yeah [18:31:58] actually....i wonder if we should just swap it with a new one, since it's already down [18:32:17] 10Operations, 10ops-eqiad: Power issue in eqiad A1 - https://phabricator.wikimedia.org/T233248 (10ayounsi) p:05Triage→03High [18:32:24] wiki_willy: https://phabricator.wikimedia.org/T233248 [18:32:31] thanks arzhel [18:32:42] as long as the other one stays up :) [18:33:02] haha . i hear ya [18:33:19] probably better when 2 people are around to upgrade it....safer that way [18:34:12] ACKNOWLEDGEMENT - Juniper alarms on asw2-a-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms Ayounsi https://phabricator.wikimedia.org/T233248 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [18:34:12] ACKNOWLEDGEMENT - Juniper alarms on cr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 2 red alarms, 0 yellow alarms Ayounsi https://phabricator.wikimedia.org/T233248 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [18:34:12] ACKNOWLEDGEMENT - IPMI Sensor Status on dns1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] Ayounsi https://phabricator.wikimedia.org/T233248 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:34:12] ACKNOWLEDGEMENT - IPMI Sensor Status on labsdb1009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 1 = Critical, Power Supplies = Critical] Ayounsi https://phabricator.wikimedia.org/T233248 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:35:10] 04Critical Alert for device cr1-eqiad.wikimedia.org - Juniper alarm active got acknowledged [18:35:19] 04Critical Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Juniper alarm active got acknowledged [18:36:15] XioNoX: i just heard back from john...he's gonna swing by in a couple min to take a look [18:37:42] great, thanks [18:38:16] np, he was able to step away from his other job for a sec [18:45:47] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10StevenJ81) Did anyone ping @MF-Warburg or @jhsoby and ask them to start copying content into this wiki? [18:46:48] !log remove `border-in4 term ddos-0906` from all routers [18:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:03] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Rschen7754) As a note for the future, the Main Page should remain as-is with the warning to wait to edit until importing has taken place. It s... [18:48:54] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw1298.eqiad.wmnet ` The log can be found in `/var/log/wmf-aut... [18:50:41] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10MF-Warburg) I am aware of this. that's what we have the newprojects mailing list for. StevenJ81 schrieb... [18:51:56] (03PS2) 10Dzahn: site: allocate mw1298 as a jobrunner, remove as spare [puppet] - 10https://gerrit.wikimedia.org/r/537658 (https://phabricator.wikimedia.org/T192457) [18:52:01] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10StevenJ81) OK. Thanks, MF-W. As far as I can tell, the Wiki wasn't even created with the customary warning page up front. But I'll let others... [18:52:44] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Urbanecm) >>! In T218155#5504251, @Rschen7754 wrote: > As a note for the future, the Main Page should remain as-is with the warning to wait to... [18:53:32] (03PS7) 10Jforrester: Enforce a 10-byte password for privileged users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) [18:54:06] !log joal@deploy1001 Started deploy [analytics/aqs/deploy@5b011d1]: Regular deploy - analytics weekly train [18:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:35] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (201909), 10Zuul: Upload zuul_2.5.1-wmf10 to apt.wikimedia.org - https://phabricator.wikimedia.org/T233025 (10herron) Hey @hashar, zuul_2.5.1-wmf10 has been uploaded... [18:54:57] (03PS1) 10Dzahn: site: remove krypton.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/537724 (https://phabricator.wikimedia.org/T231546) [18:55:11] !log joal@deploy1001 Finished deploy [analytics/aqs/deploy@5b011d1]: Regular deploy - analytics weekly train (duration: 01m 05s) [18:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:24] herron: lovely thank you VERU much :) [18:55:27] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10ayounsi) 05Resolved→03Open The Netbox/LibreNMS check is not happy: https://netbox.wikimedia.org/extras/reports/librenms.LibreNMS/ Did Netbox get updated with the new serial? [18:55:30] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10ayounsi) [18:55:31] english is crap sorry [18:55:43] hashar: np! [18:55:47] ACKNOWLEDGEMENT - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL Ayounsi https://phabricator.wikimedia.org/T227539 https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:59:17] !log joal@deploy1001 Started deploy [analytics/aqs/deploy@5b011d1]: Regular deploy - analytics weekly train - Retry after fix [18:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do MediaWiki train - American version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190918T1900). [19:01:29] !log joal@deploy1001 Finished deploy [analytics/aqs/deploy@5b011d1]: Regular deploy - analytics weekly train - Retry after fix (duration: 02m 12s) [19:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:01] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10Nuria) >The idea that access for a formal collaboration with an external non-wikimedia group is an acceptable use of resources but access for wikimedians is... [19:03:54] (03PS2) 10Dzahn: site: remove krypton.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/537724 (https://phabricator.wikimedia.org/T231546) [19:05:04] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Urbanecm) >>! In T218155#5504265, @StevenJ81 wrote: > OK. Thanks, MF-W. As far as I can tell, the Wiki wasn't even created with the customary... [19:06:09] 10Operations, 10observability: Apache mod_status aggregator - https://phabricator.wikimedia.org/T233047 (10herron) p:05Triage→03Normal [19:06:40] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10observability, and 2 others: Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089 (10herron) p:05Triage→03Normal [19:07:21] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Wikimedia-production-error: Problem with captcha at login on Wikitech: UnderflowException: Ran out of captcha images - https://phabricator.wikimedia.org/T233215 (10herron) p:05Triage→03Normal [19:07:40] !log There appear to be no blockers on T220748 so I'll proceed with deploying 1.34.0-wmf.23 to group 1. [19:07:40] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10herron) p:05Triage→03Normal [19:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:43] T220748: 1.34.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T220748 [19:08:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:15] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (201909), 10Zuul: Upload zuul_2.5.1-wmf10 to apt.wikimedia.org - https://phabricator.wikimedia.org/T233025 (10herron) 05Open→03Resolved a:03herron [19:08:38] 10Operations, 10Wikimedia-Mailing-lists, 10Wikispore: Wikispore mailing list - https://phabricator.wikimedia.org/T232961 (10herron) p:05Triage→03Normal [19:09:13] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Openstack/Newton/Stretch: more config for keystone, designate, nova [puppet] - 10https://gerrit.wikimedia.org/r/537699 (owner: 10Andrew Bogott) [19:09:15] (03PS1) 1020after4: group1 wikis to 1.34.0-wmf.23 refs T220748 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537730 [19:09:19] (03CR) 10Dzahn: [C: 03+2] site: remove krypton.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/537724 (https://phabricator.wikimedia.org/T231546) (owner: 10Dzahn) [19:09:21] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.34.0-wmf.23 refs T220748 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537730 (owner: 1020after4) [19:09:29] 10Operations, 10Phabricator, 10hardware-requests, 10serviceops, 10Release-Engineering-Team (Development services): The phabricator server, WMF7426, was given to us temporarily, we would like to make it permanent - https://phabricator.wikimedia.org/T232887 (10herron) p:05Triage→03Normal [19:09:33] (03PS3) 10Dzahn: site: remove krypton.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/537724 (https://phabricator.wikimedia.org/T231546) [19:10:14] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.23 refs T220748 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537730 (owner: 1020after4) [19:10:19] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:58] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.23 refs T220748 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537730 (owner: 1020after4) [19:13:19] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.23 refs T220748 [19:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:31] T220748: 1.34.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T220748 [19:14:24] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.23 refs T220748 (duration: 01m 04s) [19:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:28] (03PS1) 10Bstorm: toolforge-kubernetes: restructure pod security policies [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290) [19:15:31] (03PS1) 10Dzahn: ATS/varnish: remove krypton backend [puppet] - 10https://gerrit.wikimedia.org/r/537733 (https://phabricator.wikimedia.org/T231546) [19:15:50] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10Krenair) >>! In T231616#5504297, @Nuria wrote: > The analytics team does not run a platform intended for wide community access, to do so we will need many t... [19:17:15] (03PS2) 10Dzahn: ATS/varnish: remove krypton backend [puppet] - 10https://gerrit.wikimedia.org/r/537733 (https://phabricator.wikimedia.org/T231546) [19:19:36] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Wikimedia-production-error: Problem with captcha at login on Wikitech: UnderflowException: Ran out of captcha images - https://phabricator.wikimedia.org/T233215 (10Reedy) I don't understand how this is apparently only affecting Wikitech, which should use the... [19:19:55] 10Operations, 10LDAP-Access-Requests: NDA Request from WMDE employee Raja - https://phabricator.wikimedia.org/T231984 (10herron) p:05Triage→03Normal [19:20:49] 10Operations, 10Analytics, 10Traffic: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10herron) p:05Triage→03Normal [19:21:10] (03CR) 10Dzahn: [C: 03+2] ATS/varnish: remove krypton backend [puppet] - 10https://gerrit.wikimedia.org/r/537733 (https://phabricator.wikimedia.org/T231546) (owner: 10Dzahn) [19:21:57] 10Operations, 10Puppet, 10netbox: postgres::slave module type for includes parameter in inconsistent. - https://phabricator.wikimedia.org/T232358 (10herron) p:05Triage→03Normal [19:23:24] (03PS1) 10Dzahn: remove krypton.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/537737 (https://phabricator.wikimedia.org/T231546) [19:23:40] !log joal@deploy1001 Started deploy [analytics/aqs/deploy@bc9dde1]: Regular deploy - analytics weekly train - Second retry after fix [19:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:52] (03PS1) 10Jhedden: openstack: add codfw1dev keystone APIs to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/537738 (https://phabricator.wikimedia.org/T223907) [19:24:45] !log ganeti1001 - deleting krypton.eqiad.wmnet - decom T231546 [19:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:48] T231546: decom krypton.eqiad.wmnet - https://phabricator.wikimedia.org/T231546 [19:25:04] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1298.eqiad.wmnet'] ` and were **ALL** successful. [19:25:19] jouncebot: now [19:25:19] For the next 1 hour(s) and 34 minute(s): MediaWiki train - American version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190918T1900) [19:26:23] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Wikimedia-production-error: Problem with captcha at login on Wikitech: UnderflowException: Ran out of captcha images - https://phabricator.wikimedia.org/T233215 (10Reedy) ConfirmEdit isn't enabled on wikitech according to https://wikitech.wikimedia.org/wiki/... [19:26:53] James_F: ^ Any chance ConfirmEdit possibly being enabled (for some users?) might be related to your refactoring? [19:27:20] !log joal@deploy1001 Finished deploy [analytics/aqs/deploy@bc9dde1]: Regular deploy - analytics weekly train - Second retry after fix (duration: 03m 40s) [19:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:19] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10wikitech.wikimedia.org, 10Wikimedia-production-error: ConfirmEdit seemingly erroneously enabled for some users on wikitech - https://phabricator.wikimedia.org/T233215 (10Reedy) [19:28:53] joal: If you view https://wikitech.wikimedia.org/wiki/Special:Version and Ctrl/Cmd + F, do you see ConfirmEdit enabled on wikitech? [19:30:30] (03PS2) 10Jhedden: openstack: add codfw1dev keystone APIs to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/537738 (https://phabricator.wikimedia.org/T223907) [19:31:08] Hi Reedy - I guess you'd prefer not have the ConfirmEdit project associated to a wikitech issue? I did not :) But I can remove the tag if you want [19:31:23] It doesn't matter too much [19:31:33] ah ok :) [19:32:05] I have no "ConfimEdit" in the version page I see no [19:32:09] Reedy: --^ [19:32:13] Cheers [19:33:36] Reedy: I will leave for tonight, anything else before I go? [19:33:48] I don't think so, thanks [19:33:59] thank you :) bye [19:34:49] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1002/18414/" [puppet] - 10https://gerrit.wikimedia.org/r/537738 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [19:35:49] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:36:09] PROBLEM - Check size of conntrack table on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:36:19] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [19:36:21] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:36:49] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:36:57] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [19:36:59] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:37:01] (03CR) 10Jeena Huneidi: [V: 03+2 C: 03+1] scaffold: Fix bug with concatenation of args/command (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/537542 (owner: 10Alexandros Kosiaris) [19:37:07] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:37:10] ^ once again NRPE killed because it runs out of memory because people do stuff [19:38:45] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:39:55] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10wikitech.wikimedia.org, 10Wikimedia-production-error: ConfirmEdit seemingly erroneously enabled for some users on wikitech - https://phabricator.wikimedia.org/T233215 (10Reedy) [19:40:34] https://phabricator.wikimedia.org/T212824 [19:42:48] !log T233095 Purging all pages on eswiki [19:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:51] T233095: Significant mobile web performance regression observed at deployment of 1.34.0-wmf.22 - https://phabricator.wikimedia.org/T233095 [19:42:57] PROBLEM - Check the NTP synchronisation status of timesyncd on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [19:43:13] gilles: how btw? [19:43:24] purgeList.php [19:44:41] gilles: ok, be careful not to use --purge. We shoudl probably disable that in prod, super scary. [19:44:58] but varnish-only for main namespace should be fine indeed [19:46:15] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:23] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [19:46:25] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:46:37] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:46:51] RECOVERY - DPKG on notebook1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:47:32] there's strange osm-intl-... file in /srv/mediawiki-stagging at deploy1001. Why is that there? [19:47:55] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:48:14] Reedy: It's possible but unlikely that my config structural changes would have enabled ConfimEdit on wikitechwiki. [19:48:36] It's the only change that's vaguely in the area that might've [19:49:56] Well, I've got even more planned. Not good if config is randomly changing. [19:50:54] Mmm [19:50:57] I can't reproduce it though [19:56:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [19:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [19:56:26] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10wikitech.wikimedia.org, 10Wikimedia-production-error: ConfirmEdit seemingly erroneously enabled for some users on wikitech - https://phabricator.wikimedia.org/T233215 (10Reedy) I cannot replicate (logging in is fine for me)... If anyone else can replicate,... [19:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:34] 10Operations, 10decommission, 10serviceops, 10Patch-For-Review: decom krypton.eqiad.wmnet - https://phabricator.wikimedia.org/T231546 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `krypton.eqiad.wmnet` - krypton.eqiad.wmnet - Removed from Puppet master an... [19:56:57] RECOVERY - Check size of conntrack table on notebook1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:57:06] !log decommissioning Cassandra, restbase2012-c -- T224553 [19:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:09] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [19:58:53] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190918T2000). [20:00:16] no parsoid deploy today [20:00:49] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:01:25] 10Operations, 10Mail, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10Varnent) From MuckRack: do you know when the DNS records were added? We don't see any of them returning as valid on our side just yet. Possible to share a screen shot of how those records were a... [20:03:24] (03PS12) 10Andrew Bogott: codfw1dev: move designate to 'Newton' [puppet] - 10https://gerrit.wikimedia.org/r/537692 [20:03:30] I'm grabbing the conch. [20:03:39] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10wikitech.wikimedia.org, 10Wikimedia-production-error: ConfirmEdit seemingly erroneously enabled for some users on wikitech - https://phabricator.wikimedia.org/T233215 (10Reedy) 05Open→03Stalled Marking stalled as above... No rush @JAllemandou, but can... [20:03:42] (03CR) 10Jforrester: [C: 03+2] Enforce a 10-byte password for privileged users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [20:04:53] (03Merged) 10jenkins-bot: Enforce a 10-byte password for privileged users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [20:05:37] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: move designate to 'Newton' [puppet] - 10https://gerrit.wikimedia.org/r/537692 (owner: 10Andrew Bogott) [20:06:41] (03CR) 10jenkins-bot: Enforce a 10-byte password for privileged users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [20:07:04] (03CR) 10Dzahn: [C: 03+2] remove krypton.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/537737 (https://phabricator.wikimedia.org/T231546) (owner: 10Dzahn) [20:07:09] (03PS2) 10Dzahn: remove krypton.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/537737 (https://phabricator.wikimedia.org/T231546) [20:07:45] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve most-read articles for date with no data (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:07:52] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T208246 Enforce a 10-byte password for privileged users (duration: 01m 04s) [20:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:12] T208246: Change password length requirement and ensure enforcement for privileged users (from 8 to 10) - https://phabricator.wikimedia.org/T208246 [20:09:05] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:09:45] (03CR) 10Ayounsi: Initial support for custom scripts (035 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (https://phabricator.wikimedia.org/T230449) (owner: 10CRusnov) [20:09:48] (03PS1) 10Eevans: sessionstore: Upgrade image to 2019-09-18-090156-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/537747 (https://phabricator.wikimedia.org/T229697) [20:10:54] (03CR) 10Eevans: [V: 03+2 C: 03+2] sessionstore: Upgrade image to 2019-09-18-090156-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/537747 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [20:11:38] (03PS9) 10Jforrester: Variant configuration: Never write to serialised PHP, drop support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) [20:11:46] (03CR) 10Jforrester: [C: 03+2] Variant configuration: Never write to serialised PHP, drop support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [20:11:56] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [20:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:43] (03Merged) 10jenkins-bot: Variant configuration: Never write to serialised PHP, drop support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [20:12:59] (03CR) 10jenkins-bot: Variant configuration: Never write to serialised PHP, drop support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [20:13:31] RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1003 is OK: OK: synced at Wed 2019-09-18 20:13:30 UTC. https://wikitech.wikimedia.org/wiki/NTP [20:14:31] 10Operations, 10decommission, 10serviceops, 10Patch-For-Review: decom krypton.eqiad.wmnet - https://phabricator.wikimedia.org/T231546 (10Dzahn) [20:14:41] 10Operations, 10decommission, 10serviceops, 10Patch-For-Review: decom krypton.eqiad.wmnet - https://phabricator.wikimedia.org/T231546 (10Dzahn) 05Open→03Resolved [20:14:44] 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) [20:15:23] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'sessionstore' for release 'production' . [20:15:26] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Variant configuration: Never write to serialised PHP T223602 (duration: 01m 04s) [20:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:41] T223602: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 [20:15:44] 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) krypton has been decom'ed fully [20:15:58] jouncebot: now [20:15:58] For the next 0 hour(s) and 44 minute(s): MediaWiki train - American version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190918T1900) [20:15:58] For the next 0 hour(s) and 44 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190918T2000) [20:17:10] Urbanecm: I still have the conch. [20:17:23] Is there an unbreak-now? [20:17:38] (03PS3) 10Jforrester: CommonSettings: Factor out call to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537218 [20:18:03] James_F: no, it can wait till tomorrow :) [20:18:04] !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: Variant configuration: Drop suport for serialised PHP (duration: 01m 04s) [20:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:14] Kk. [20:18:24] (03CR) 10Jforrester: [C: 03+2] CommonSettings: Factor out call to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537218 (owner: 10Jforrester) [20:18:40] Urbanecm: This one I'm landing now should get rid of the duplicate function warnings you're seeing. [20:18:52] cool, thanks James_F ! [20:18:58] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'sessionstore' for release 'production' . [20:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:12] (03Merged) 10jenkins-bot: CommonSettings: Factor out call to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537218 (owner: 10Jforrester) [20:19:28] (03CR) 10jenkins-bot: CommonSettings: Factor out call to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537218 (owner: 10Jforrester) [20:20:36] (03PS3) 10Herron: mediawiki: Use HTTPS for /nl-portal and /be-portal redirects [puppet] - 10https://gerrit.wikimedia.org/r/518099 (owner: 10Krinkle) [20:21:49] (03PS2) 10Jforrester: CommonSettings: Move Beta-Cluster variant load into wmfLoadInitialiseSettings() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537219 [20:22:40] 10Operations, 10Cloud-Services, 10procurement, 10cloud-services-team (Kanban): ssl renewal: *.wmflabs.org expires 2019-11-16 - https://phabricator.wikimedia.org/T233176 (10bd808) >>! In T233176#5504187, @Krenair wrote: > @Bstorm, it might need procurement if we've ruled out LE, but AFAIK that is still a pe... [20:22:56] (03CR) 10Herron: [C: 03+2] mediawiki: Use HTTPS for /nl-portal and /be-portal redirects [puppet] - 10https://gerrit.wikimedia.org/r/518099 (owner: 10Krinkle) [20:23:02] (03CR) 10Jforrester: CommonSettings: Move Beta-Cluster variant load into wmfLoadInitialiseSettings() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537219 (owner: 10Jforrester) [20:23:30] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: CommonSettings: Factor out call to InitialiseSettings.php (duration: 01m 04s) [20:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:44] 10Operations, 10MediaWiki-extensions-OATHAuth: Cannot enable 2FA on testwiki - https://phabricator.wikimedia.org/T233146 (10Reedy) This may be due to {T222099} [20:30:53] (03PS1) 10Mholloway: RESTBase: Configure wikifeeds_uri [puppet] - 10https://gerrit.wikimedia.org/r/537750 (https://phabricator.wikimedia.org/T170455) [20:32:18] 10Operations, 10Mail, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10herron) DNS records were added Monday at approx. 10:30am Eastern. Here is what I'm seeing in terms of responses from live DNS: ` $ host -t MX pr.wikimedia.org pr.wikimedia.org mail is handled by... [20:34:38] 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10ayounsi) 05Open→03Resolved From Level3: > I appreciate your patience while we worked on gathering the data on these repair tickets. I’ve attached the repair ticket log abov... [20:36:33] (03PS1) 10Jforrester: tests: Migrate LoggingTest to avoid WgConfTestCase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537751 [20:37:39] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Urbanecm) The "classic" main page is in place. [20:37:46] (03CR) 10Jforrester: [C: 03+2] CommonSettings: Move Beta-Cluster variant load into wmfLoadInitialiseSettings() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537219 (owner: 10Jforrester) [20:37:56] foks: Oops, that wasn't meant to happen. Fixing now. [20:38:18] :) [20:38:41] 10Operations, 10Cloud-Services, 10procurement, 10cloud-services-team (Kanban): ssl renewal: *.wmflabs.org expires 2019-11-16 - https://phabricator.wikimedia.org/T233176 (10Krenair) >>! In T233176#5504563, @bd808 wrote: >>>! In T233176#5504187, @Krenair wrote: >> @Bstorm, it might need procurement if we've... [20:38:48] (03Merged) 10jenkins-bot: CommonSettings: Move Beta-Cluster variant load into wmfLoadInitialiseSettings() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537219 (owner: 10Jforrester) [20:39:03] (03CR) 10jenkins-bot: CommonSettings: Move Beta-Cluster variant load into wmfLoadInitialiseSettings() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537219 (owner: 10Jforrester) [20:40:35] !log jforrester@deploy1001 scap failed: average error rate on 9/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [20:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:42] Well, that's not good. [20:43:15] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Quick fix for wmfLoadInitialiseSettings() (duration: 01m 03s) [20:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:30] foks: OK, *now* it should be fixed. [20:43:53] (03PS2) 10Jforrester: tests: Migrate LoggingTest to avoid WgConfTestCase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537751 [20:43:55] (03PS1) 10Jforrester: wmfLoadInitialiseSettings: Declare wmfRealm as a global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537753 [20:44:47] (03CR) 10Jforrester: [C: 03+2] wmfLoadInitialiseSettings: Declare wmfRealm as a global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537753 (owner: 10Jforrester) [20:45:36] (03Merged) 10jenkins-bot: wmfLoadInitialiseSettings: Declare wmfRealm as a global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537753 (owner: 10Jforrester) [20:46:26] (03CR) 10jenkins-bot: wmfLoadInitialiseSettings: Declare wmfRealm as a global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537753 (owner: 10Jforrester) [20:46:36] 10Operations, 10observability: Apache mod_status aggregator - https://phabricator.wikimedia.org/T233047 (10ori) >>! In T233047#5497988, @Joe wrote: > @ori I'm not 100% sure I got what information you think would be useful to extract. At first glance it would seem like collecting those data in a structured mann... [20:50:30] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: No longer load InitialiseSettings at all in CommonSettings (duration: 01m 03s) [20:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:05] (03CR) 10Jforrester: [C: 03+2] tests: Migrate LoggingTest to avoid WgConfTestCase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537751 (owner: 10Jforrester) [20:52:51] (03Merged) 10jenkins-bot: tests: Migrate LoggingTest to avoid WgConfTestCase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537751 (owner: 10Jforrester) [20:53:07] (03CR) 10jenkins-bot: tests: Migrate LoggingTest to avoid WgConfTestCase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537751 (owner: 10Jforrester) [20:56:27] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:56:35] PROBLEM - Host ps1-a1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [20:59:02] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10Dzahn) phab1002 is already gone from repo since a while. This host is called mw1298 and i just reinstalled it again just in case. What is there really left to do here? [21:01:09] (03PS3) 10Jhedden: openstack: add codfw1dev keystone APIs to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/537738 (https://phabricator.wikimedia.org/T223907) [21:03:38] (03PS1) 10Bstorm: toolforge-k8s: proposed role for all tools [puppet] - 10https://gerrit.wikimedia.org/r/537755 (https://phabricator.wikimedia.org/T227290) [21:04:38] (03PS2) 10Jforrester: Move VariantSettings back to InitialiseSettings now that the migration is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 [21:04:40] (03PS1) 10Jforrester: tests: Skip the Cirrus configuration tests as they're inextricable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537756 [21:05:02] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10Dzahn) [21:05:41] (03CR) 10Jhedden: [C: 03+2] openstack: add codfw1dev keystone APIs to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/537738 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [21:06:15] (03CR) 10Jforrester: [C: 04-1] "I'd really rather not do this. :-(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537756 (owner: 10Jforrester) [21:06:32] 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10Dzahn) Looks to me no other steps are needed for decom. Besides maybe checking the switch port description and physlcal label. Next we can apply a mediawiki role (jo... [21:06:59] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:07:43] RECOVERY - Juniper alarms on asw2-a-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [21:07:59] RECOVERY - Juniper alarms on cr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [21:09:20] !log enable damping on codfw-ulsfo link - T196432 [21:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:24] T196432: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 [21:13:40] !log enable damping on primary codfw-eqiad link - T196432 [21:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:57] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 3 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn) Thanks @WMDE-leszek. ! I think we can close this now @BBlack @Vgutierrez I don't see more leftovers to clean up. [21:16:07] 10Operations, 10Traffic, 10netops: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi) 05Open→03Resolved All primary link of all transport pairs have now damping configured. [21:16:17] (03CR) 10Cwhite: [C: 03+2] profile: use prometheus for logstash alerting [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [21:19:57] RECOVERY - Host ps1-a1-eqiad is UP: PING WARNING - Packet loss = 64%, RTA = 1.58 ms [21:23:10] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqiad.wikimedia.org recovered from Juniper alarm active [21:23:19] PROBLEM - ps1-a1-eqiad-infeed-load-tower-B-phase-Y on ps1-a1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:23:23] PROBLEM - ps1-a1-eqiad-infeed-load-tower-B-phase-X on ps1-a1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:23:30] (03PS6) 10Cwhite: profile: use prometheus for logstash alerting [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) [21:23:51] PROBLEM - ps1-a1-eqiad-infeed-load-tower-A-phase-Z on ps1-a1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:24:15] PROBLEM - ps1-a1-eqiad-infeed-load-tower-B-phase-Z on ps1-a1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:24:15] PROBLEM - ps1-a1-eqiad-infeed-load-tower-A-phase-Y on ps1-a1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:24:21] PROBLEM - ps1-a1-eqiad-infeed-load-tower-A-phase-X on ps1-a1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:24:28] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [21:25:26] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10Eevans) restbase2012 is decommissioned and can be reimaged at any time. [21:27:05] (03CR) 10Cwhite: [C: 03+2] profile: use prometheus for logstash alerting [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [21:27:13] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Juniper alarm active [21:32:53] RECOVERY - IPMI Sensor Status on dns1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:33:18] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [21:34:19] (03PS1) 10Dzahn: racktables: set db host in Hiera, set to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/537761 (https://phabricator.wikimedia.org/T224247) [21:39:55] (03PS1) 10Dzahn: iegreview: set db host in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/537762 (https://phabricator.wikimedia.org/T224247) [21:40:57] (03PS2) 10Dzahn: racktables: set db host in Hiera, set to eqiad, use lookup [puppet] - 10https://gerrit.wikimedia.org/r/537761 (https://phabricator.wikimedia.org/T224247) [21:43:48] (03PS1) 10Dzahn: wikimania_scholarships: set db host in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/537763 (https://phabricator.wikimedia.org/T224247) [21:49:38] 10Operations, 10vm-requests: Site: 2 VMs for puppetdb - https://phabricator.wikimedia.org/T230609 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff VMs have been created. [21:53:28] !log gilles@deploy1001 Synchronized php-1.34.0-wmf.22/maintenance/purgeList.php: T233095 Make purgeList.php use getCdnUrls() (duration: 01m 04s) [21:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:32] T233095: Significant mobile web performance regression observed at deployment of 1.34.0-wmf.22 - https://phabricator.wikimedia.org/T233095 [21:53:38] (03PS7) 10Dzahn: gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 [21:54:52] !log T233095 Purging all eswiki articles (both desktop and mobile this time) [21:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Check for faulty optic asw-c-eqiad to cr1-eqiad - https://phabricator.wikimedia.org/T233265 (10Cmjohnson) [21:58:40] (03PS8) 10Dzahn: gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 [22:00:15] 10Operations, 10ops-eqiad: backup1001 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T232882 (10Jclark-ctr) @jcrespo all Disk show connection light can you please reverify [22:00:56] (03PS3) 10Dzahn: site: allocate mw1298 as a jobrunner, add to conftool [puppet] - 10https://gerrit.wikimedia.org/r/537658 (https://phabricator.wikimedia.org/T192457) [22:04:23] gilles: deployment clear? [22:04:34] yes [22:04:55] OK, doing https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/AbuseFilter/+/537702/ next [22:13:38] !log disabling asw2-c-eqiad xe-2/0/45 - cr1-eqiad to replace optic T233265 [22:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:41] T233265: Check for faulty optic asw-c-eqiad to cr1-eqiad - https://phabricator.wikimedia.org/T233265 [22:17:19] (03PS2) 10Bstorm: toolforge-kubernetes: restructure pod security policies [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290) [22:18:02] (03PS2) 10Bstorm: toolforge-k8s: proposed role for all tools [puppet] - 10https://gerrit.wikimedia.org/r/537755 (https://phabricator.wikimedia.org/T227290) [22:23:49] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:29:55] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:30:42] HHVM/200/POST and PHP7+HHVM/200/GET both appear to have doubled in latencies [22:31:22] Since when? [22:31:45] 22:21:00 UTC? [22:31:58] Nothing in the SAL. [22:32:39] indeed [22:32:44] probably user induced [22:32:54] Some massive bot run? [22:33:44] looks like I might have a lucky strike of three composer fetches succeeding in a row, whih is what I need for the job to pass [22:33:51] \o/ [22:34:24] * James_F grins. [22:34:52] * Krinkle staging on mwdebug1002 [22:37:14] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.23/extensions/AbuseFilter/includes/: T156095, ff44043efa59e9 (duration: 01m 05s) [22:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:18] T156095: Re-enable AbuseFilterCachingParser once we are sure it's safe - https://phabricator.wikimedia.org/T156095 [22:38:33] PROBLEM - Host ms-be1027 is DOWN: PING CRITICAL - Packet loss = 100% [22:40:54] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.23/resources/Resources.php: d6dadfdb0b237c918 (duration: 01m 03s) [22:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:06] next up - https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/537765/ [22:41:35] Jdlrobson: I can do yours next [22:42:01] prepare to wait a while though, per T233264 [22:42:04] T233264: Jenkins jobs failing with Composer TransportException: 404 Not Found (September 2019) - https://phabricator.wikimedia.org/T233264 [22:58:58] !log enabled asw2-c-eqiad interface xe-2/0/45 [22:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:05] MaxSem, RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190918T2300). [23:00:05] Jdlrobson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:03:39] I've given Jdlrobson a +2 a few times already given the composer issuse [23:03:45] looks like 3rd time it might work [23:05:55] thanks Krinkle [23:09:09] PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - dewiki_content_1566659363[6](2019-09-15T13:39:44.466Z), enwiki_content_1546970425[3](2019-09-15T13:39:54.892Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:12:22] hurrah [23:13:53] Jdlrobson: staged on mwdebug1002, please test/verify ) [23:14:31] on it [23:14:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Check for faulty optic asw-c-eqiad to cr1-eqiad - https://phabricator.wikimedia.org/T233265 (10Cmjohnson) swapped both optics on cr1-eqiad and asw2-c xe-2/045. Giving it 24 hours to see if any errors return [23:15:10] 10Operations, 10ops-eqiad: Power issue in eqiad A1 - https://phabricator.wikimedia.org/T233248 (10Cmjohnson) John replaced side A pdu with a new PDU. [23:15:31] Krinkle: checked. JS error gone [23:15:33] sync away [23:16:09] 10Operations, 10ops-eqiad: Power issue in eqiad A1 - https://phabricator.wikimedia.org/T233248 (10Cmjohnson) 05Open→03Resolved resolving this task [23:16:51] Jdlrobson: the src/ is not used, so it's okay if I sync dist/ and src/ separately, right? [23:17:02] yep that's fine [23:17:02] (as opposed ot the whole repo, though can do that too) [23:17:06] OK [23:17:13] the dist is all that matters for the bug [23:18:33] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.23/extensions/MobileFrontend/resources/dist/: T233260, 1667ed957a19067 (duration: 01m 04s) [23:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:37] T233260: Regression: `Cannot read property 'route' of undefined` JS error on all mobile main page - https://phabricator.wikimedia.org/T233260 [23:25:57] (03PS2) 10Nuria: Adding config for friendly values on netflow dataset [puppet] - 10https://gerrit.wikimedia.org/r/537564 (https://phabricator.wikimedia.org/T229682) [23:32:35] (03PS13) 10Cwhite: ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) [23:33:25] (03CR) 10jerkins-bot: [V: 04-1] ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [23:38:12] (03CR) 10Cwhite: "It may be worth combining I27b3c86fbeb266f76a0f32c0b352937b240ce633 with this changeset. What do you think?" [puppet] - 10https://gerrit.wikimedia.org/r/537362 (https://phabricator.wikimedia.org/T233089) (owner: 10Filippo Giunchedi) [23:39:51] (03CR) 10Cwhite: "Metrics coverage: Matched 10290/10316; Percent Complete 99%." [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)