[00:04:23] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received https://wikitech.wi [00:04:23] Services/Monitoring/mobileapps [00:06:40] 10Operations, 10ops-eqiad, 10vm-requests, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Cmjohnson) @Jclark-ctr can you fix the mgmt password for these please. [00:07:41] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:09:43] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:35] (03PS1) 10CDanis: grafana: remove rsyncs after complete migration [puppet] - 10https://gerrit.wikimedia.org/r/552933 (https://phabricator.wikimedia.org/T220838) [00:35:05] (03CR) 10Dzahn: [C: 03+1] "discussed in meeting and needed to resolve https://phabricator.wikimedia.org/T238905" [puppet] - 10https://gerrit.wikimedia.org/r/552304 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [00:36:55] (03CR) 10CDanis: [C: 03+2] grafana: remove rsyncs after complete migration [puppet] - 10https://gerrit.wikimedia.org/r/552933 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [00:38:42] 10Operations, 10Discovery-Search, 10SRE-Access-Requests, 10Patch-For-Review: Allow analytics-search-users members to sudo as the airflow user - https://phabricator.wikimedia.org/T238905 (10Dzahn) The group has been created and request was approved in SRE meeting. It needs merge of https://gerrit.wikimedia... [00:40:03] (03PS1) 10CDanis: grafana: first, ensure=>absent the obsolete rsyncs [puppet] - 10https://gerrit.wikimedia.org/r/552934 (https://phabricator.wikimedia.org/T220838) [00:43:20] (03CR) 10CDanis: [C: 03+2] grafana: first, ensure=>absent the obsolete rsyncs [puppet] - 10https://gerrit.wikimedia.org/r/552934 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [00:44:26] (03PS1) 10BBlack: Add and use check_dns_query [puppet] - 10https://gerrit.wikimedia.org/r/552935 (https://phabricator.wikimedia.org/T98006) [00:55:05] (03PS3) 10Dzahn: Allow analytics-search-users to manage search/airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/552304 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [00:56:13] (03PS1) 10CDanis: grafana: okay, *now* remove the obsolete rsyncs [puppet] - 10https://gerrit.wikimedia.org/r/552936 (https://phabricator.wikimedia.org/T220838) [01:00:24] (03CR) 10CDanis: [C: 03+2] grafana: okay, *now* remove the obsolete rsyncs [puppet] - 10https://gerrit.wikimedia.org/r/552936 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [01:15:51] (03CR) 10Dzahn: [C: 03+2] Allow analytics-search-users to manage search/airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/552304 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [01:18:25] (03CR) 10CRusnov: "> Patch Set 5:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [01:18:41] (03CR) 10Dzahn: "..i should have edited the commit message to say "airflow-search-admins" since that's what this is now." [puppet] - 10https://gerrit.wikimedia.org/r/552304 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [01:22:59] 10Operations, 10Discovery-Search, 10SRE-Access-Requests, 10Patch-For-Review: Allow analytics-search-users members to sudo as the airflow user - https://phabricator.wikimedia.org/T238905 (10Dzahn) ` [an-airflow1001:~] $ id ebernhardson uid=3088(ebernhardson) gid=500(wikidev) groups=500(wikidev),816(airflow... [01:23:41] (03CR) 10CRusnov: [C: 03+1] "> Patch Set 1: Verified-1" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552919 (owner: 10Faidon Liambotis) [01:23:51] 10Operations, 10Discovery-Search, 10SRE-Access-Requests: Allow analytics-search-users members to sudo as the airflow user - https://phabricator.wikimedia.org/T238905 (10Dzahn) 05Open→03Resolved [01:25:42] 10Operations, 10Discovery-Search, 10SRE-Access-Requests: Allow analytics-search-users members to sudo as the airflow user - https://phabricator.wikimedia.org/T238905 (10Dzahn) I merged the change by @EBernhardson which added the new group on `an-airflow1001`. Puppet has created the users and sudo privileges... [01:36:01] 10Operations, 10DNS, 10Traffic: Create wildcard DNS record for Wikimedia projects - https://phabricator.wikimedia.org/T238825 (10Dzahn) Really wildcard or more like "populate DNS (langlist.tmpl) with all language codes from [[ https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab | ISO-693-3... [01:49:24] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/552521 (https://phabricator.wikimedia.org/T187708) (owner: 10Filippo Giunchedi) [02:18:57] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1035.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:33:48] (03CR) 10Dzahn: [C: 03+1] "personally i'm in the "whatever" camp in this one but i see you have a lot of phab tokens ;)" [puppet] - 10https://gerrit.wikimedia.org/r/552626 (https://phabricator.wikimedia.org/T228757) (owner: 10Aklapper) [02:35:17] (03CR) 10Dzahn: [C: 03+1] "survey during tech conf: whatever: 18 yes: 12 no: 1" [puppet] - 10https://gerrit.wikimedia.org/r/552626 (https://phabricator.wikimedia.org/T228757) (owner: 10Aklapper) [03:06:27] (03PS3) 10Dzahn: mediawiki::maintenance: add envoy for TLS termination for noc.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/539633 (https://phabricator.wikimedia.org/T210411) [03:11:06] (03PS4) 10Dzahn: mediawiki::maintenance: add envoy for TLS termination for noc.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/539633 (https://phabricator.wikimedia.org/T210411) [03:14:04] (03CR) 10Dzahn: "This is following the same scheme used for a bunch of other services on the linked ticket. noc.wm.org is one of the few remaining misc one" [puppet] - 10https://gerrit.wikimedia.org/r/539633 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [03:15:29] (03CR) 10Dzahn: "cert already in private repo since a while ago. in DNS i used "maintenance" first but changing it to "mwmaint" for consistency." [puppet] - 10https://gerrit.wikimedia.org/r/539633 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [03:18:49] (03PS1) 10Dzahn: rename maintenance.discovery to mwmaint.discovery [dns] - 10https://gerrit.wikimedia.org/r/552944 [03:21:47] (03PS2) 10Dzahn: rename maintenance.discovery to mwmaint.discovery [dns] - 10https://gerrit.wikimedia.org/r/552944 (https://phabricator.wikimedia.org/T210411) [03:25:51] (03PS1) 10Dzahn: otrs: add envoy for TLS termination behind ATS [puppet] - 10https://gerrit.wikimedia.org/r/552947 [03:43:27] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:04:03] 10Operations, 10Traffic, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) I've debugged locally what we seen yesterday on production with the following lua script: `lang=lua WEBSOCKET_SUPPORT = nil function __init__(argtb) dofil... [04:14:30] 10Operations, 10Traffic, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10BBlack) @Vgutierrez - I really think, reading the Lua plugin code, that `__reload__` in 8.0.x might not do what you'd sanely expect (although it is undocumented). I thin... [04:18:17] RECOVERY - Memory correctable errors -EDAC- on mw1239 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [05:08:21] !log Start pre-steps for s7 failover - T238044 [05:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:27] T238044: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 [05:10:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set weight 0 to db1086 as it will be the new s7 master - T238044', diff saved to https://phabricator.wikimedia.org/P9741 and previous config saved to /var/cache/conftool/dbconfig/20191126-051034-marostegui.json [05:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:52] (03PS4) 10Marostegui: mariadb: Promote db1086 to s7 primary master [puppet] - 10https://gerrit.wikimedia.org/r/552381 (https://phabricator.wikimedia.org/T238044) [05:14:02] (03PS4) 10Marostegui: wmnet: Update s7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/552382 (https://phabricator.wikimedia.org/T238044) [05:25:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1086 to s7 primary master [puppet] - 10https://gerrit.wikimedia.org/r/552381 (https://phabricator.wikimedia.org/T238044) (owner: 10Marostegui) [05:28:07] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.8 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:33:17] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 76.54 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:38:20] (03PS1) 10Ayounsi: rename cr2-knams to cr3, bundle knams-esams links [homer/public] - 10https://gerrit.wikimedia.org/r/552951 [05:45:47] (03PS3) 10Ammarpad: Enable Translate extension on sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552870 (https://phabricator.wikimedia.org/T239091) [05:49:01] !log Deploy schema change on dbstore1003:3311 [05:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:19] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:51:33] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:52:07] (03PS1) 10Ayounsi: Rename cr2-knams to cr3-knams [dns] - 10https://gerrit.wikimedia.org/r/552952 (https://phabricator.wikimedia.org/T237030) [05:55:54] (03PS1) 10Ayounsi: Rename cr2-knams to cr3-knams [puppet] - 10https://gerrit.wikimedia.org/r/552953 (https://phabricator.wikimedia.org/T237030) [06:00:04] marostegui and jynus: How many deployers does it take to do s7 database master failover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191126T0600). [06:00:08] jynus: ready? [06:00:10] yes [06:00:14] !log Starting s7 failover from db1062 to db1086 - T238044 [06:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:19] T238044: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 [06:00:24] !log marostegui@cumin2001 dbctl commit (dc=all): 'Set s7 as read-only for maintenance T238044', diff saved to https://phabricator.wikimedia.org/P9742 and previous config saved to /var/cache/conftool/dbconfig/20191126-060023-marostegui.json [06:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:39] ro confirmed [06:00:42] same [06:01:09] !log marostegui@cumin2001 dbctl commit (dc=all): 'Promote db1086 on s7 master and remove read-only from s7 T238044', diff saved to https://phabricator.wikimedia.org/P9743 and previous config saved to /var/cache/conftool/dbconfig/20191126-060108-marostegui.json [06:01:12] topology done [06:01:12] RO off [06:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:24] I can edit [06:01:25] I can edit [06:01:36] excellent, going to monitor errors [06:03:21] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/552382 (https://phabricator.wikimedia.org/T238044) (owner: 10Marostegui) [06:13:55] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:14:07] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:18:38] (03PS1) 10Ema: Revert "Revert "ATS: enable reload for global Lua script"" [puppet] - 10https://gerrit.wikimedia.org/r/552955 [06:19:34] 10Operations, 10DBA, 10Patch-For-Review, 10User-notice: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 (10Marostegui) [06:30:04] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [06:30:09] 10Operations, 10DBA, 10Patch-For-Review, 10User-notice: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 (10Marostegui) 05Open→03Resolved This was done successfully. Read only starts: 06:00:24 Read only stops: 06:01:09 Tot... [06:31:02] (03PS2) 10Ema: ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) [06:33:52] marostegui: I have a router to rename, it might be a be noisy, but no issues expected, are you done with your maintenance? [06:34:04] XioNoX: yeah, all done, thanks :) [06:34:12] (03CR) 10jerkins-bot: [V: 04-1] ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [06:34:16] cool! [06:34:30] !log Rename cr2-knams to cr3-knams - T237030 [06:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:36] T237030: Setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 [06:34:56] (03CR) 10Ayounsi: [C: 03+2] Rename cr2-knams to cr3-knams [dns] - 10https://gerrit.wikimedia.org/r/552952 (https://phabricator.wikimedia.org/T237030) (owner: 10Ayounsi) [06:35:01] (03PS2) 10Ayounsi: Rename cr2-knams to cr3-knams [dns] - 10https://gerrit.wikimedia.org/r/552952 (https://phabricator.wikimedia.org/T237030) [06:35:32] (03PS1) 10Marostegui: db1136: Make db1136 candidate master for s7 [puppet] - 10https://gerrit.wikimedia.org/r/552957 (https://phabricator.wikimedia.org/T238044) [06:37:22] (03CR) 10Ayounsi: [C: 03+2] Rename cr2-knams to cr3-knams [puppet] - 10https://gerrit.wikimedia.org/r/552953 (https://phabricator.wikimedia.org/T237030) (owner: 10Ayounsi) [06:37:24] (03CR) 10Marostegui: [C: 03+2] db1136: Make db1136 candidate master for s7 [puppet] - 10https://gerrit.wikimedia.org/r/552957 (https://phabricator.wikimedia.org/T238044) (owner: 10Marostegui) [06:37:52] XioNoX: ok to merge your change? [06:38:05] marostegui: yep [06:38:12] XioNoX: done [06:38:15] thx [06:42:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1087 from vslow, and pool db1092 temporarily as vslow,dump for s8, for a schema change on db1087', diff saved to https://phabricator.wikimedia.org/P9744 and previous config saved to /var/cache/conftool/dbconfig/20191126-064200-marostegui.json [06:42:02] 10Operations, 10Parsoid-PHP, 10serviceops, 10Patch-For-Review, 10Wikimedia-production-error: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10jcrespo) This is ongoing, so adding production error tag: ` PHP Fatal error: Allowed memory size of 692060160 bytes exhausted (trie... [06:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:04] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 54.83 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:43:45] !log Deploy schema change on db1087 with replication, lag will be generated on s8 for labsdb hosts [06:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:05] !log Remove triggers for ar_comment on db1124:3318 T234704 [06:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:10] T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 [06:44:29] 10Operations, 10ops-esams, 10netops, 10Patch-For-Review: Setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10ayounsi) 05Open→03Resolved a:03ayounsi All done. [06:44:31] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10ayounsi) [06:45:50] (03CR) 10Elukey: "Thanks a lot Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/552304 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [06:46:10] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 71.16 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:47:42] 10Operations, 10Traffic, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10ema) >>! In T233274#5691926, @Vgutierrez wrote: > I've debugged locally what we seen yesterday on production with the following lua script: > `lang=lua > WEBSOCKET_SUP... [06:49:31] (03CR) 10Elukey: airflow: Run webserver and scheduler processes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [06:51:17] !log Run compare.py for db2125 - T239042 [06:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:22] T239042: db2125 crashed - https://phabricator.wikimedia.org/T239042 [06:53:30] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [06:54:12] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10elukey) @RobH my bad! Thanks a lot for the patience @Cmjohnson, I'll add more pictures to the blog post when it will be allowed to be publ... [06:57:12] PROBLEM - Juniper alarms on cr3-knams is CRITICAL: JNX_ALARMS CRITICAL - The requested table is empty or does not exist https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [06:57:22] (03CR) 10Elukey: [C: 03+1] "I am a bit ignorant with TLS so I'll state the obvious: it seems to me that no current TLS certificate (signed by the current CA's private" [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond) [06:58:29] (03PS1) 10Marostegui: mariadb: Set db1061 to spare [puppet] - 10https://gerrit.wikimedia.org/r/552960 (https://phabricator.wikimedia.org/T238624) [06:59:12] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1061 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552961 (https://phabricator.wikimedia.org/T238624) [07:00:04] (03CR) 10Elukey: [C: 03+1] create analytics-web.discovery.wmnet, point to thorium [dns] - 10https://gerrit.wikimedia.org/r/551938 (owner: 10Dzahn) [07:00:48] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db1061 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552961 (https://phabricator.wikimedia.org/T238624) (owner: 10Marostegui) [07:00:56] (03CR) 10Elukey: [C: 03+1] ATS/varnish: rename thorium director to analytics-web [puppet] - 10https://gerrit.wikimedia.org/r/551939 (owner: 10Dzahn) [07:01:15] (03CR) 10Marostegui: [C: 03+2] mariadb: Set db1061 to spare [puppet] - 10https://gerrit.wikimedia.org/r/552960 (https://phabricator.wikimedia.org/T238624) (owner: 10Marostegui) [07:01:37] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1061 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552961 (https://phabricator.wikimedia.org/T238624) (owner: 10Marostegui) [07:03:48] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db1061 from config T238624 (duration: 00m 54s) [07:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:53] T238624: Decommission db1061.eqiad.wmnet - https://phabricator.wikimedia.org/T238624 [07:04:03] 10Operations, 10Traffic, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10ema) Tried this on a test instance instead: `lang=lua function read_config() local confffile = ts.get_config_dir() .. "/default.lua.conf" ts.error("Load... [07:04:53] (03PS1) 10Marostegui: db1136: Add "s7" [puppet] - 10https://gerrit.wikimedia.org/r/552962 [07:05:29] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db1061 from config T238624 (duration: 00m 52s) [07:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:45] (03CR) 10Marostegui: [C: 03+2] db1136: Add "s7" [puppet] - 10https://gerrit.wikimedia.org/r/552962 (owner: 10Marostegui) [07:09:17] !log Stop MySQL on db1061 - T238624 [07:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:22] T238624: Decommission db1061.eqiad.wmnet - https://phabricator.wikimedia.org/T238624 [07:12:30] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:12:49] (03PS1) 10Marostegui: db1062: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552963 (https://phabricator.wikimedia.org/T239188) [07:14:36] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:14:55] (03CR) 10Marostegui: [C: 03+2] db1062: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552963 (https://phabricator.wikimedia.org/T239188) (owner: 10Marostegui) [07:15:50] the BFD down seems related to knams [07:16:39] yep Transport: cr2-knams:xe-1/1/0.13 [07:16:48] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:17:02] ^ me [07:17:37] ack [07:17:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1061 from config - T238624', diff saved to https://phabricator.wikimedia.org/P9745 and previous config saved to /var/cache/conftool/dbconfig/20191126-071746-marostegui.json [07:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:53] T238624: Decommission db1061.eqiad.wmnet - https://phabricator.wikimedia.org/T238624 [07:18:06] (03PS3) 10Ema: ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) [07:19:38] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:21:09] (03CR) 10jerkins-bot: [V: 04-1] ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [07:21:13] !log mobrovac@deploy1001 Started deploy [restbase/deploy@378f504] (dev-cluster): Do not use duplicate filter definitions [07:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:50] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:28:49] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@378f504] (dev-cluster): Do not use duplicate filter definitions (duration: 07m 36s) [07:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:26] 10Operations, 10ops-eqiad, 10Analytics: analytics1057's BBU is faulty - https://phabricator.wikimedia.org/T239045 (10elukey) >>! In T239045#5691484, @Jclark-ctr wrote: > @elukey No spare bbu around @Jclark-ctr hi! In https://phabricator.wikimedia.org/T233080 analytics1032 needs to be decommed, maybe we can... [07:29:34] !log mobrovac@deploy1001 Started deploy [restbase/deploy@378f504]: Do not use duplicate filter definitions T234266 [07:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:41] T234266: Cannot read property 'stored' of undefined - https://phabricator.wikimedia.org/T234266 [07:29:48] (03PS1) 10Mobrovac: Parsoid: Switch mw.org to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552965 (https://phabricator.wikimedia.org/T229015) [07:36:27] 10Operations, 10ops-eqiad, 10DC-Ops: Duplicate cable label in cr1-eqiad/cr2-eqiad - https://phabricator.wikimedia.org/T239098 (10ayounsi) 05Open→03Resolved Updated. [07:43:58] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@378f504]: Do not use duplicate filter definitions T234266 (duration: 14m 24s) [07:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:04] T234266: Cannot read property 'stored' of undefined - https://phabricator.wikimedia.org/T234266 [07:50:37] (03CR) 10Mobrovac: [C: 03+2] Parsoid: Switch mw.org to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552965 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [07:51:49] (03Merged) 10jenkins-bot: Parsoid: Switch mw.org to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552965 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [07:53:04] !log mobrovac@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Parsoid: Switch Flow to Parsoid/PHP on mw.org -- T229015 (duration: 00m 52s) [07:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:09] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [08:01:51] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.78 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:04:37] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 56.75 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:05:49] (03CR) 10Vgutierrez: ATS: enable reload for global Lua script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [08:09:01] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 71.73 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:13:05] (03PS1) 10Marostegui: db2125: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552971 (https://phabricator.wikimedia.org/T239042) [08:13:21] (03CR) 10Marostegui: [C: 04-2] "Not yet ready" [puppet] - 10https://gerrit.wikimedia.org/r/552971 (https://phabricator.wikimedia.org/T239042) (owner: 10Marostegui) [08:13:43] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.51 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:14:22] (03PS1) 10Mobrovac: Parsoid: Switch groups 0 and 1 to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552972 (https://phabricator.wikimedia.org/T229015) [08:18:45] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 82.36 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:43:57] PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:10:15] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:10:21] ^ me [09:10:25] will be pushing in a few minutes [09:11:02] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM overall, we'll need to tweak Kafka consumer groups to avoid overlaps with current consumer groups" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [09:12:21] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:14:42] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use max5m for node_ipvs gauges [puppet] - 10https://gerrit.wikimedia.org/r/552810 (https://phabricator.wikimedia.org/T236700) (owner: 10Filippo Giunchedi) [09:18:36] !log Run maintain-views for wikidatawiki.protected_title view on labsdb hosts T233135 [09:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:42] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [09:24:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1087 into s8 vslow,dump', diff saved to https://phabricator.wikimedia.org/P9748 and previous config saved to /var/cache/conftool/dbconfig/20191126-092409-marostegui.json [09:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:07] (03PS1) 10Giuseppe Lavagetto: envoy-tls: proxy the admin interface too. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/553045 [09:26:50] !log Deploy schema change on s8 primary master (db1109) - T234066 T233135 T237120 [09:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:57] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [09:26:58] T237120: Schema change on production for increase the size of wbt_text_in_lang.wbxl_language - https://phabricator.wikimedia.org/T237120 [09:26:58] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [09:27:14] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Addshore) [09:27:25] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:27:36] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 2 others: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10Addshore) 05Open→03Resolved a:03Ladsgroup I'll close this one along with the subtask then :) [09:27:38] 10Operations, 10Traffic: ats-be on the text cluster is experiencing broken connections - https://phabricator.wikimedia.org/T236988 (10Addshore) [09:29:13] 10Operations, 10MediaWiki-JobQueue, 10Wikidata, 10Performance-Team (Radar), and 3 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710 (10Addshore) [09:29:29] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:31:25] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:32:02] godog: is that you? ^ [09:35:01] (03PS1) 10Muehlenhoff: Switch Ganeti servers in esams/ulsfo to Buster [puppet] - 10https://gerrit.wikimedia.org/r/553046 (https://phabricator.wikimedia.org/T236216) [09:36:31] (03PS2) 10Muehlenhoff: Switch Ganeti servers in esams/ulsfo to Buster [puppet] - 10https://gerrit.wikimedia.org/r/553046 (https://phabricator.wikimedia.org/T236216) [09:38:51] marostegui: oops, yes [09:38:55] marostegui: merged [09:38:59] :) [09:39:07] RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:40:03] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:44:27] (03PS3) 10Muehlenhoff: Switch Ganeti servers in esams/ulsfo to Buster [puppet] - 10https://gerrit.wikimedia.org/r/553046 (https://phabricator.wikimedia.org/T236216) [09:44:54] (03PS3) 10Effie Mouzeli: Remove apache systemd override now that tmpreaper is fixed [puppet] - 10https://gerrit.wikimedia.org/r/489982 (https://phabricator.wikimedia.org/T185195) (owner: 10Muehlenhoff) [09:45:43] !log Disable puppet on all mediawiki servers to test 489982 [09:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:02] (03CR) 10jerkins-bot: [V: 04-1] Remove apache systemd override now that tmpreaper is fixed [puppet] - 10https://gerrit.wikimedia.org/r/489982 (https://phabricator.wikimedia.org/T185195) (owner: 10Muehlenhoff) [09:46:29] (03CR) 10Filippo Giunchedi: "LGTM, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [09:47:19] (03PS4) 10Ammarpad: Enable Translate extension on sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552870 (https://phabricator.wikimedia.org/T239091) [09:52:49] (03PS1) 10Effie Mouzeli: mediawiki: Remove apache systemd override for tmpreaper [puppet] - 10https://gerrit.wikimedia.org/r/553052 (https://phabricator.wikimedia.org/T185195) [09:54:10] (03PS1) 10Muehlenhoff: Remove gehel from airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/553053 [09:54:53] (03Abandoned) 10Effie Mouzeli: Remove apache systemd override now that tmpreaper is fixed [puppet] - 10https://gerrit.wikimedia.org/r/489982 (https://phabricator.wikimedia.org/T185195) (owner: 10Muehlenhoff) [09:56:04] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Remove apache systemd override for tmpreaper [puppet] - 10https://gerrit.wikimedia.org/r/553052 (https://phabricator.wikimedia.org/T185195) (owner: 10Effie Mouzeli) [09:56:33] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 49.93 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:59:34] (03PS2) 10Effie Mouzeli: mediawiki: Remove apache systemd override for tmpreaper [puppet] - 10https://gerrit.wikimedia.org/r/553052 (https://phabricator.wikimedia.org/T185195) [10:01:17] (03PS3) 10Effie Mouzeli: mediawiki: Remove apache systemd override for tmpreaper [puppet] - 10https://gerrit.wikimedia.org/r/553052 (https://phabricator.wikimedia.org/T185195) [10:06:51] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 89.98 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:07:46] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [10:07:46] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:01] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [10:08:01] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:19] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [10:08:19] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:33] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [10:08:34] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:42] I eventually got it right -.- [10:13:35] 10Operations: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10fgiunchedi) >>! In T237438#5690914, @Cmjohnson wrote: > @fgiunchedi These are ready for you for implementation. I removed the ops-eqiad tag. if you have an issue please assign to me and ad... [10:13:49] 10Operations, 10DC-Ops: HP SSD Failure Firmware Fix - https://phabricator.wikimedia.org/T239211 (10Reedy) [10:15:27] 10Operations, 10ops-esams: Update spare QFX labels - https://phabricator.wikimedia.org/T237014 (10mark) 05Open→03Resolved a:03mark Done. [10:20:33] (03CR) 10Muehlenhoff: mediawiki: Remove apache systemd override for tmpreaper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553052 (https://phabricator.wikimedia.org/T185195) (owner: 10Effie Mouzeli) [10:21:45] (03PS1) 10Filippo Giunchedi: hieradata: add ms-be105[7-9] [puppet] - 10https://gerrit.wikimedia.org/r/553056 (https://phabricator.wikimedia.org/T237438) [10:21:59] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "Thanks for working on this!" (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [10:22:24] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add ms-be105[7-9] [puppet] - 10https://gerrit.wikimedia.org/r/553056 (https://phabricator.wikimedia.org/T237438) (owner: 10Filippo Giunchedi) [10:24:42] (03PS1) 10Muehlenhoff: Add IDP config to backups [puppet] - 10https://gerrit.wikimedia.org/r/553058 (https://phabricator.wikimedia.org/T233936) [10:24:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3311 for schema change', diff saved to https://phabricator.wikimedia.org/P9749 and previous config saved to /var/cache/conftool/dbconfig/20191126-102442-marostegui.json [10:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:18] (03CR) 10Arturo Borrero Gonzalez: profile::url_downloader: Add missing labs neutron subnet, also link-local (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552631 (owner: 10Alex Monk) [10:26:52] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/552789 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [10:30:26] !log swift eqiad-prod: add ms-be105[7-9] - T237438 [10:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:31] T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 [10:31:19] (03CR) 10Effie Mouzeli: mediawiki: Remove apache systemd override for tmpreaper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553052 (https://phabricator.wikimedia.org/T185195) (owner: 10Effie Mouzeli) [10:31:46] 10Operations, 10Patch-For-Review: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10fgiunchedi) a:05RobH→03fgiunchedi [10:34:38] 10Operations, 10Patch-For-Review: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10fgiunchedi) @Cmjohnson @Jclark-ctr I'm not blocked on this (thus no reassigning) but ms-be1059 is in row D judging by its ip address and netbox says row C. I believe n... [10:36:33] (03CR) 10Effie Mouzeli: "pcc looks scary, but I think we should go ahead and test the change" [puppet] - 10https://gerrit.wikimedia.org/r/553052 (https://phabricator.wikimedia.org/T185195) (owner: 10Effie Mouzeli) [10:38:48] (03CR) 10Alexandros Kosiaris: "Minor nitpicks, but approach looks sane to me" (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/553045 (owner: 10Giuseppe Lavagetto) [10:38:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] envoy-tls: proxy the admin interface too. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/553045 (owner: 10Giuseppe Lavagetto) [10:39:31] (03PS1) 10Mathew.onipe: query_service: use the correct script for autodeployment [puppet] - 10https://gerrit.wikimedia.org/r/553063 [10:41:31] PROBLEM - Check systemd state on ms-be2049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:41:59] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2049 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:43:25] 10Operations, 10Traffic, 10netops, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10fgiunchedi) >>! In T224888#5690188, @CDanis wrote: > I've a proposal for doing this: > > - Add some special tag like `#NRPE` or `#page` to the names of any... [10:44:59] (03CR) 10Jbond: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond) [10:46:06] (03CR) 10Jbond: [C: 03+2] admin: add jiji to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/552818 (owner: 10Effie Mouzeli) [10:49:07] (03PS2) 10Jbond: admin: add jiji to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/552818 (owner: 10Effie Mouzeli) [10:49:52] (03CR) 10Jcrespo: "You know you can configure a single (or multiple) file as a dataset, right?" [puppet] - 10https://gerrit.wikimedia.org/r/553058 (https://phabricator.wikimedia.org/T233936) (owner: 10Muehlenhoff) [10:50:04] !log Updated jenkins job operations-puppet-tests-stretch-docker to use latest Docker container [10:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:09] jbond42: I have updated the puppet job ^ [10:50:36] great5 thanks hashar [10:50:36] ACKNOWLEDGEMENT - MegaRAID on dbstore1003 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T239217 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:50:40] 10Operations, 10ops-eqiad: Degraded RAID on dbstore1003 - https://phabricator.wikimedia.org/T239217 (10ops-monitoring-bot) [10:51:49] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on dbstore1003 - https://phabricator.wikimedia.org/T239217 (10Marostegui) [10:52:25] (03CR) 10Muehlenhoff: "Sure, but see the explanation in the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/553058 (https://phabricator.wikimedia.org/T233936) (owner: 10Muehlenhoff) [10:53:31] PROBLEM - WDQS high update lag on wdqs1009 is CRITICAL: 2.902e+04 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:54:37] (03CR) 10Jbond: [C: 03+2] admin: add jiji to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/552818 (owner: 10Effie Mouzeli) [10:54:39] looking [10:56:03] (03PS1) 10Muehlenhoff: Remove unused ganeti-instance-debootstrap [puppet] - 10https://gerrit.wikimedia.org/r/553081 [10:56:28] (03CR) 10Jcrespo: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/553058 (https://phabricator.wikimedia.org/T233936) (owner: 10Muehlenhoff) [10:58:41] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 57.28 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191126T1100). [11:00:04] Ammarpad: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:23] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 75 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:00:49] o/ [11:04:11] (03PS2) 10Jcrespo: check_mariadb.py: Update bacula logic to the latest Bacula class [puppet] - 10https://gerrit.wikimedia.org/r/552860 [11:04:13] (03PS1) 10Jcrespo: backup: Move filesets to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/553084 (https://phabricator.wikimedia.org/T238048) [11:04:39] (03CR) 10Jcrespo: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/553084 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [11:04:56] (03CR) 10jerkins-bot: [V: 04-1] check_mariadb.py: Update bacula logic to the latest Bacula class [puppet] - 10https://gerrit.wikimedia.org/r/552860 (owner: 10Jcrespo) [11:08:41] (03PS2) 10Jcrespo: backup: Move filesets to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/553084 (https://phabricator.wikimedia.org/T238048) [11:08:49] (03CR) 10Muehlenhoff: [C: 03+2] Remove unused ganeti-instance-debootstrap [puppet] - 10https://gerrit.wikimedia.org/r/553081 (owner: 10Muehlenhoff) [11:09:17] (03CR) 10Jcrespo: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/553084 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [11:10:07] (03CR) 10Jcrespo: "This is in preparation of the director profile cleanup. Will require documentation update." [puppet] - 10https://gerrit.wikimedia.org/r/553084 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [11:10:51] (03CR) 10jerkins-bot: [V: 04-1] backup: Move filesets to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/553084 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [11:13:21] (03PS3) 10Jcrespo: backup: Move filesets to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/553084 (https://phabricator.wikimedia.org/T238048) [11:13:59] (03CR) 10Jcrespo: [C: 03+1] "Please keep an eye too on https://gerrit.wikimedia.org/r/c/operations/puppet/+/553084 that the merge of the refactoring is done properly." [puppet] - 10https://gerrit.wikimedia.org/r/553058 (https://phabricator.wikimedia.org/T233936) (owner: 10Muehlenhoff) [11:15:42] (03CR) 10Jbond: [C: 03+1] "LGTM optional nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552857 (owner: 10Muehlenhoff) [11:15:53] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:16:45] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:17:58] (03PS4) 10Ema: ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) [11:18:07] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:19:47] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:19:47] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:19:49] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:20:05] (03CR) 10Jbond: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/553053 (owner: 10Muehlenhoff) [11:20:09] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:20:15] (03PS2) 10Jbond: Remove gehel from airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/553053 (owner: 10Muehlenhoff) [11:20:27] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [11:20:46] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1002/19621/backup1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/553084 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [11:21:35] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out bef [11:21:35] s received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:21:51] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good artic [11:21:51] ut before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:21:55] (03CR) 10Muehlenhoff: [C: 03+1] "Ack, let's give this a shot on one server with Puppet disabled for the rest." [puppet] - 10https://gerrit.wikimedia.org/r/553052 (https://phabricator.wikimedia.org/T185195) (owner: 10Effie Mouzeli) [11:22:19] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Mon [11:22:19] s [11:22:53] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed o [11:22:53] nse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:23:49] PROBLEM - Nginx local proxy to apache on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:23:49] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:23:59] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:24:09] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:24:09] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [11:24:35] PROBLEM - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:24:37] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [11:25:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/553058 (https://phabricator.wikimedia.org/T233936) (owner: 10Muehlenhoff) [11:25:40] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: Remove apache systemd override for tmpreaper [puppet] - 10https://gerrit.wikimedia.org/r/553052 (https://phabricator.wikimedia.org/T185195) (owner: 10Effie Mouzeli) [11:25:49] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:26:19] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:26:32] hey sigh [11:26:35] great [11:26:44] There might be issues with wikidata master [11:26:53] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:26:59] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:26:59] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:04] marostegui: is it related with the mw* alerts? [11:27:08] probably [11:27:12] it is now unaccesible [11:27:15] I am checking [11:27:17] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:27:17] PROBLEM - Nginx local proxy to apache on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:27:29] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [11:27:31] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:35] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:27:35] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:27:41] PROBLEM - Nginx local proxy to apache on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:27:49] PROBLEM - PHP7 rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:27:51] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:28:07] PROBLEM - PHP7 rendering on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:28:19] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:28:21] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:28:21] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:28:21] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:28:22] it should be fine now [11:28:23] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:28:31] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:28:35] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:28:37] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:28:39] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:28:41] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:28:41] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:28:49] RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:28:49] RECOVERY - Nginx local proxy to apache on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:28:51] RECOVERY - Nginx local proxy to apache on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:28:55] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:29:09] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:29:11] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:29:13] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:29:14] temporary glitch or something else? [11:29:17] RECOVERY - Nginx local proxy to apache on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:29:21] RECOVERY - PHP7 rendering on mw1344 is OK: HTTP OK: HTTP/1.1 200 OK - 75449 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:29:23] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:29:25] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:29:31] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:29:41] RECOVERY - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [11:29:41] RECOVERY - PHP7 rendering on mw1340 is OK: HTTP OK: HTTP/1.1 200 OK - 75449 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:29:43] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:30:11] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [11:30:17] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:30:19] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [11:30:33] PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:30:35] uff stat1007 is separate, taking care of it [11:30:37] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:43] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:30:47] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:31:13] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [11:31:23] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [11:31:23] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:31:31] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [11:31:31] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [11:33:53] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [11:34:37] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10MaxSem) Sent. [11:35:28] !enable puppet on mw canary servers, and restart apaches [11:35:37] !log enable puppet on mw canary servers, and restart apaches [11:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:42] !log Deploy schema change on db1139:3311 [11:35:45] PROBLEM - SSH on stat1007 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:06] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10jbond) @Fuzzy are you able to provide me with an email address, please feel free to email me directly (jbond@wikimedia.org) if you would prefer your address not... [11:36:15] !log reboot stat1007 [11:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:41] PROBLEM - Host stat1007 is DOWN: PING CRITICAL - Packet loss = 100% [11:39:17] RECOVERY - SSH on stat1007 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:39:21] RECOVERY - Host stat1007 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [11:39:45] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [11:39:57] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [11:40:05] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [11:40:05] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:40:29] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [11:40:33] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:40:35] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [11:40:47] RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:40:51] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:05] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:46:46] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on dbstore1003 - https://phabricator.wikimedia.org/T239217 (10jbond) p:05Triage→03Normal [11:47:25] 10Operations, 10DC-Ops: HP SSD Failure Firmware Fix - https://phabricator.wikimedia.org/T239211 (10jbond) p:05Triage→03Normal [11:47:52] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org: Redirect all traffic for fixcopyright.wikimedia.org to https://policy.wikimedia.org/policy-landing/copyright/ - https://phabricator.wikimedia.org/T239141 (10jbond) p:05Triage→03Normal [11:50:06] 10Operations: Add Daimona to #mediawiki_security - https://phabricator.wikimedia.org/T239093 (10jbond) @Urbanecm looks like you are i the channel now so resolving this ticket but please reopen if there is still an issue [11:50:23] 10Operations: Add Daimona to #mediawiki_security - https://phabricator.wikimedia.org/T239093 (10jbond) 05Open→03Resolved p:05Triage→03Normal [11:50:37] (03PS1) 10Effie Mouzeli: install_server: use raid1-gpt-lvm-ext4-srv.cfg recipe for mw* [puppet] - 10https://gerrit.wikimedia.org/r/553095 (https://phabricator.wikimedia.org/T156955) [11:51:05] 10Operations, 10Traffic, 10observability: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 (10jbond) p:05Triage→03Normal [11:51:25] (03CR) 10jerkins-bot: [V: 04-1] install_server: use raid1-gpt-lvm-ext4-srv.cfg recipe for mw* [puppet] - 10https://gerrit.wikimedia.org/r/553095 (https://phabricator.wikimedia.org/T156955) (owner: 10Effie Mouzeli) [11:51:41] 10Operations, 10ops-codfw: Degraded RAID on ganeti2002 - https://phabricator.wikimedia.org/T239009 (10jbond) p:05Triage→03High [11:52:17] 10Operations, 10serviceops: Appservers rising GET latency might have triggered LVS pages - https://phabricator.wikimedia.org/T238973 (10jbond) p:05Triage→03Normal [11:52:50] (03PS2) 10Effie Mouzeli: install_server: use raid1-gpt-lvm-ext4-srv.cfg recipe for mw* [puppet] - 10https://gerrit.wikimedia.org/r/553095 (https://phabricator.wikimedia.org/T156955) [11:52:51] (03PS10) 10Muehlenhoff: Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 [11:53:19] 10Operations, 10DNS, 10Traffic: Create wildcard DNS record for Wikimedia projects - https://phabricator.wikimedia.org/T238825 (10jbond) p:05Triage→03Normal [11:53:27] 10Operations, 10netops, 10User-jbond: Sporatic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10jbond) p:05Triage→03Normal [11:55:00] (03PS1) 10Alaa Sarhan: Update cron with lb and lb-pool params [puppet] - 10https://gerrit.wikimedia.org/r/553097 (https://phabricator.wikimedia.org/T238751) [11:57:22] (03PS2) 10Faidon Liambotis: Fix some spelling issues [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552919 [11:58:28] (03CR) 10Faidon Liambotis: "I don't see anything wrong with "spare" in the middle, and regardless, it'd be a pain to change all of these switches across all sites now" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550051 (https://phabricator.wikimedia.org/T237464) (owner: 10CRusnov) [12:00:06] (03CR) 10Addshore: Update cron with lb and lb-pool params (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553097 (https://phabricator.wikimedia.org/T238751) (owner: 10Alaa Sarhan) [12:01:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. The current mw-raid1-lvm.cfg recipe allocates 700G for / and 100G for /srv and the new recipe 50G in /root and 80% of th" [puppet] - 10https://gerrit.wikimedia.org/r/553095 (https://phabricator.wikimedia.org/T156955) (owner: 10Effie Mouzeli) [12:01:44] (03CR) 10Giuseppe Lavagetto: profile::mediawiki::httpd: set a SERVERGROUP env variable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546448 (https://phabricator.wikimedia.org/T235899) (owner: 10Giuseppe Lavagetto) [12:01:50] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::httpd: set a SERVERGROUP env variable [puppet] - 10https://gerrit.wikimedia.org/r/546448 (https://phabricator.wikimedia.org/T235899) [12:04:23] (03PS1) 10Jbond: profile::idp::client::httpd: correct redirect [puppet] - 10https://gerrit.wikimedia.org/r/553098 [12:04:35] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1007 is OK: OK: synced at Tue 2019-11-26 12:04:34 UTC. https://wikitech.wikimedia.org/wiki/NTP [12:07:16] (03CR) 10Jbond: [C: 03+2] profile::idp::client::httpd: correct redirect [puppet] - 10https://gerrit.wikimedia.org/r/553098 (owner: 10Jbond) [12:07:29] !log power down mr1-esams for replacement - T238174 [12:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:35] T238174: mr1-esams RMA - https://phabricator.wikimedia.org/T238174 [12:07:50] jouncebot: now [12:07:50] No deployments scheduled for the next 3 hour(s) and 52 minute(s) [12:10:16] is it okay if I deploy a beta-only config change? [12:10:45] PROBLEM - Host asw2-esams.mgmt.esams.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [12:11:30] I don’t really want to do it with the evening SWAT (middle of the night, EU time)… [12:11:59] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:12:15] that's expected ^ (mr1-esams work) [12:12:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "-1 because of a typo, but +1 on premise and implementation. Feel free to merge after the typo fix." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553084 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [12:13:05] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:14:38] (03PS4) 10Jcrespo: backup: Move filesets to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/553084 (https://phabricator.wikimedia.org/T238048) [12:15:13] (03CR) 10Jcrespo: backup: Move filesets to a separate file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553084 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [12:15:19] 10Operations, 10ops-esams: cr3-esams:et-1/0/0 flap - https://phabricator.wikimedia.org/T236767 (10faidon) 05Open→03Resolved a:03faidon Looks good! [12:16:00] (03CR) 10Muehlenhoff: [C: 03+2] Setup rsync config for U2F device storage [puppet] - 10https://gerrit.wikimedia.org/r/552821 (owner: 10Muehlenhoff) [12:16:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::httpd: set a SERVERGROUP env variable [puppet] - 10https://gerrit.wikimedia.org/r/546448 (https://phabricator.wikimedia.org/T235899) (owner: 10Giuseppe Lavagetto) [12:17:31] (03PS1) 10Ayounsi: Disable Juniper alarm check for cr3-knams [puppet] - 10https://gerrit.wikimedia.org/r/553099 (https://phabricator.wikimedia.org/T237030) [12:22:51] (03PS5) 10Ema: ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) [12:25:04] (03CR) 10jerkins-bot: [V: 04-1] ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [12:26:27] (03PS6) 10Ema: ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) [12:27:37] (03CR) 10Jbond: [C: 03+2] puppetboard: add proxied_as parameter [puppet] - 10https://gerrit.wikimedia.org/r/552859 (https://phabricator.wikimedia.org/T238924) (owner: 10Jbond) [12:27:46] (03PS4) 10Jbond: puppetboard: add proxied_as parameter [puppet] - 10https://gerrit.wikimedia.org/r/552859 (https://phabricator.wikimedia.org/T238924) [12:29:31] 10Operations: Add Daimona to #mediawiki_security - https://phabricator.wikimedia.org/T239093 (10Daimona) 05Resolved→03Open @jbond I'm sorry, this request is not for Urbanecm but for me (Daimona). [12:32:12] (03CR) 10Lucas Werkmeister (WMDE): Wikibase (beta-only): Update wmgWikibaseClientDataBridgeHrefRegExp (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552498 (https://phabricator.wikimedia.org/T238918) (owner: 10Lucas Werkmeister (WMDE)) [12:32:28] jouncebot: now [12:32:28] No deployments scheduled for the next 3 hour(s) and 27 minute(s) [12:33:48] I’ll deploy a beta-only config change [12:34:18] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Wikibase (beta-only): Update wmgWikibaseClientDataBridgeHrefRegExp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552498 (https://phabricator.wikimedia.org/T238918) (owner: 10Lucas Werkmeister (WMDE)) [12:34:27] (03PS2) 10Lucas Werkmeister (WMDE): Wikibase (beta-only): Update wmgWikibaseClientDataBridgeHrefRegExp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552498 (https://phabricator.wikimedia.org/T238918) [12:35:17] (03CR) 10Lucas Werkmeister (WMDE): Wikibase (beta-only): Update wmgWikibaseClientDataBridgeHrefRegExp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552498 (https://phabricator.wikimedia.org/T238918) (owner: 10Lucas Werkmeister (WMDE)) [12:35:20] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Wikibase (beta-only): Update wmgWikibaseClientDataBridgeHrefRegExp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552498 (https://phabricator.wikimedia.org/T238918) (owner: 10Lucas Werkmeister (WMDE)) [12:35:44] (03PS1) 10Jbond: puppetboard cas: add correct port [puppet] - 10https://gerrit.wikimedia.org/r/553100 [12:36:04] (03Merged) 10jenkins-bot: Wikibase (beta-only): Update wmgWikibaseClientDataBridgeHrefRegExp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552498 (https://phabricator.wikimedia.org/T238918) (owner: 10Lucas Werkmeister (WMDE)) [12:37:00] mwdebug1001 seems fine, syncing [12:38:32] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::httpd: use SetEnvIf, not setenv [puppet] - 10https://gerrit.wikimedia.org/r/553101 [12:38:43] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:552498|Wikibase (beta-only): Update wmgWikibaseClientDataBridgeHrefRegExp (T238918)]] (duration: 00m 53s) [12:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:48] T238918: Add full item title to bridge RegExp and pass to app - https://phabricator.wikimedia.org/T238918 [12:38:55] (03CR) 10Effie Mouzeli: "> Looks good to me. The current mw-raid1-lvm.cfg recipe allocates" [puppet] - 10https://gerrit.wikimedia.org/r/553095 (https://phabricator.wikimedia.org/T156955) (owner: 10Effie Mouzeli) [12:39:08] (03CR) 10Jbond: [C: 03+2] puppetboard cas: add correct port [puppet] - 10https://gerrit.wikimedia.org/r/553100 (owner: 10Jbond) [12:39:16] (03PS2) 10Jbond: puppetboard cas: add correct port [puppet] - 10https://gerrit.wikimedia.org/r/553100 [12:40:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19623/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/553101 (owner: 10Giuseppe Lavagetto) [12:49:01] (03PS2) 10BBlack: Add and use check_dns_query [puppet] - 10https://gerrit.wikimedia.org/r/552935 (https://phabricator.wikimedia.org/T98006) [12:52:25] (03CR) 10Faidon Liambotis: Disable Juniper alarm check for cr3-knams (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553099 (https://phabricator.wikimedia.org/T237030) (owner: 10Ayounsi) [12:58:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553095 (https://phabricator.wikimedia.org/T156955) (owner: 10Effie Mouzeli) [12:58:37] (03CR) 10Giuseppe Lavagetto: [C: 04-1] allow different memory limit settings for parsoid-php servers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548944 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [13:03:19] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.71 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:03:46] !log Remove tmpreaper package from all mediawiki servers - T229792 [13:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:51] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [13:04:03] (03CR) 10Ayounsi: [C: 03+2] Disable Juniper alarm check for cr3-knams [puppet] - 10https://gerrit.wikimedia.org/r/553099 (https://phabricator.wikimedia.org/T237030) (owner: 10Ayounsi) [13:05:16] (03PS7) 10Giuseppe Lavagetto: allow different memory limit settings for parsoid-php servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548944 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [13:05:52] (03PS1) 10Jbond: profile::idp::client::httpd: update document root variable [puppet] - 10https://gerrit.wikimedia.org/r/553104 [13:06:29] RECOVERY - Check systemd state on ms-be2049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:45] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 71.87 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:06:57] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2049 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:10:22] (03CR) 10Jbond: [C: 03+2] profile::idp::client::httpd: update document root variable [puppet] - 10https://gerrit.wikimedia.org/r/553104 (owner: 10Jbond) [13:11:54] !log enable puppet on mediawiki servers [13:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:03] (03CR) 10Giuseppe Lavagetto: [C: 04-1] install_server: use raid1-gpt-lvm-ext4-srv.cfg recipe for mw* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553095 (https://phabricator.wikimedia.org/T156955) (owner: 10Effie Mouzeli) [13:17:17] (03PS1) 10Arturo Borrero Gonzalez: toolforge: proxy: enable nginx prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/553105 (https://phabricator.wikimedia.org/T237643) [13:18:12] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10hashar) [13:19:08] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10hashar) About "access to gerrit sql", would it be sufficient to do a database dump from production and load that in a MySQL server local to the test VM? [13:19:25] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10hashar) p:05Triage→03Normal [13:20:01] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10hashar) [13:21:34] PROBLEM - Host cp3064.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:21:36] PROBLEM - Host cp3061.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:21:36] PROBLEM - Host cp3062.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:21:36] PROBLEM - Host cp3063.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:21:36] PROBLEM - Host cp3065.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:21:36] PROBLEM - Host lvs3007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:21:36] PROBLEM - Host ps1-oe16-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:21:36] PROBLEM - Host re0.cr3-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:21:37] PROBLEM - Host scs-oe16-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:21:46] PROBLEM - Host bast3004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:21:46] PROBLEM - Host ganeti3003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:23:38] RECOVERY - Host cp3064.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.13 ms [13:23:38] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:23:42] RECOVERY - Host ps1-oe16-esams is UP: PING OK - Packet loss = 0%, RTA = 84.73 ms [13:23:46] RECOVERY - Host bast3004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.84 ms [13:23:56] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:24:00] RECOVERY - Host cp3061.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.06 ms [13:24:22] RECOVERY - Host asw2-esams.mgmt.esams.wmnet is UP: PING OK - Packet loss = 0%, RTA = 84.05 ms [13:25:23] (03PS2) 10Arturo Borrero Gonzalez: toolforge: proxy: enable nginx prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/553105 (https://phabricator.wikimedia.org/T237643) [13:25:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:26:07] icinga sucks :) [13:26:28] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10mark) [13:26:30] RECOVERY - Host ganeti3003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 83.96 ms [13:26:41] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10mark) All SERVER power cords have been audited in this sheet: https://phabricator.wikimedia.org/T237009 Missing labels on either end of the cord have been augmented. All should be... [13:26:56] RECOVERY - Host cp3062.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.04 ms [13:26:56] RECOVERY - Host cp3063.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.07 ms [13:26:56] RECOVERY - Host cp3065.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.10 ms [13:26:56] RECOVERY - Host lvs3007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.01 ms [13:26:56] RECOVERY - Host re0.cr3-esams is UP: PING OK - Packet loss = 0%, RTA = 88.46 ms [13:26:57] RECOVERY - Host scs-oe16-esams is UP: PING OK - Packet loss = 0%, RTA = 83.88 ms [13:27:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: proxy: enable nginx prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/553105 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [13:27:24] 04Critical Alert for device mr1-esams.wikimedia.org - Device rebooted [13:30:32] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:30:36] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10Gilles) Seems like the varnish re-imaging and repooling of cp3064 helped: https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?orgId=1&var-met... [13:30:56] (03CR) 10Mobrovac: [C: 03+1] allow different memory limit settings for parsoid-php servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548944 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [13:31:29] (03PS3) 10BBlack: Add and use check_dns_query [puppet] - 10https://gerrit.wikimedia.org/r/552935 (https://phabricator.wikimedia.org/T98006) [13:31:31] (03PS1) 10BBlack: authdns: configure explicit service/monitor addrs [puppet] - 10https://gerrit.wikimedia.org/r/553108 (https://phabricator.wikimedia.org/T98006) [13:31:33] (03PS1) 10BBlack: authdns: move per-server monitors to 5353 [puppet] - 10https://gerrit.wikimedia.org/r/553109 (https://phabricator.wikimedia.org/T98006) [13:31:36] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) Seems like the varnish re-imaging and repooling of cp3064 helped: https://grafana.wikimedia.org/d/000000230/navigation-ti... [13:31:55] (03PS10) 10Arturo Borrero Gonzalez: toolforge: proxy: adjust setup for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/543135 (https://phabricator.wikimedia.org/T234037) [13:32:47] (03CR) 10jerkins-bot: [V: 04-1] authdns: move per-server monitors to 5353 [puppet] - 10https://gerrit.wikimedia.org/r/553109 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [13:33:04] anybody checking logstash? [13:33:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:33:42] (03CR) 10jerkins-bot: [V: 04-1] authdns: configure explicit service/monitor addrs [puppet] - 10https://gerrit.wikimedia.org/r/553108 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [13:34:07] 10Operations, 10ops-esams, 10netops: mr1-esams RMA - https://phabricator.wikimedia.org/T238174 (10ayounsi) 05Open→03Resolved This is all done. [13:34:19] I'm checking now [13:34:37] 10Operations, 10ops-esams, 10netops: mr1-esams RMA - https://phabricator.wikimedia.org/T238174 (10ayounsi) 05Resolved→03Open I guess we can keep it open until we return the old one. [13:35:02] godog: I think that there is a page requested causing db timeouts [13:35:57] (03CR) 10Jcrespo: [C: 03+2] backup: Move filesets to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/553084 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [13:36:04] or more [13:36:10] (03PS5) 10Jcrespo: backup: Move filesets to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/553084 (https://phabricator.wikimedia.org/T238048) [13:36:43] indeed WMFTimeoutException from line 39 of /srv/mediawiki/wmf-config/set-time-limit.php: the execution time limit of 60 seconds was exceeded [13:37:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:38:19] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10BBlack) 05Open→03Resolved Seems good so far, has been up a few days and in full service for about a day, without incident. Calling this resolved until anything changes! [13:38:24] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10BBlack) [13:39:50] godog: also a lot of elukey@krb1001:~$ sudo manage_principals.py create awight --email_address=adam.wight@wikimedia.de [13:39:53] Principal successfully created. [13:39:55] ufff [13:39:58] that was not it [13:40:05] (sorry awight :P) [13:40:13] ErrorException from line 1591 of /srv/mediawiki/php-1.35.0-wmf.5/includes/GlobalFunctions.php: PHP Notice: Array to string conversion [13:40:22] this was what I wanted to paste [13:40:39] godog: --^ [13:41:01] elukey: indeed [13:42:43] and seems that all the fun is happening for en-wiki [13:43:53] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:46:11] yeah so s1 is showing some monitoring latency and lag, maybe related? https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-1h&to=now&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-role=All [13:46:40] jynus marostegui ^ does that graph plus mediawiki timeouts ring a bell ? [13:46:53] godog: check -security [13:48:25] (03CR) 10Muehlenhoff: "One comment inline, if the username is the same, it seems fine to drop the previous entry from the YAML user table, after all the primary " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn) [13:52:56] 10Operations, 10Analytics, 10Analytics-Kanban, 10SRE-Access-Requests: Add system user analytics-privatedata to the anaytics-privatedata-users group - https://phabricator.wikimedia.org/T238306 (10elukey) [13:55:17] (03PS2) 10BBlack: authdns: configure explicit service/monitor addrs [puppet] - 10https://gerrit.wikimedia.org/r/553108 (https://phabricator.wikimedia.org/T98006) [13:55:19] (03PS2) 10BBlack: authdns: move per-server monitors to 5353 [puppet] - 10https://gerrit.wikimedia.org/r/553109 (https://phabricator.wikimedia.org/T98006) [13:56:24] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:56:24] (03CR) 10jerkins-bot: [V: 04-1] authdns: move per-server monitors to 5353 [puppet] - 10https://gerrit.wikimedia.org/r/553109 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [13:56:59] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) >>! In T238494#5693304, @Gilles wrote: > Seems like the varnish re-imaging and repooling of cp3064 helped: Interestingly, th... [13:57:17] (03CR) 10jerkins-bot: [V: 04-1] authdns: configure explicit service/monitor addrs [puppet] - 10https://gerrit.wikimedia.org/r/553108 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [13:57:56] (03PS4) 10Ema: ATS: explicitly skip the cache instead of hiding CC [puppet] - 10https://gerrit.wikimedia.org/r/552076 (https://phabricator.wikimedia.org/T238494) [13:58:33] (03PS3) 10BBlack: authdns: configure explicit service/monitor addrs [puppet] - 10https://gerrit.wikimedia.org/r/553108 (https://phabricator.wikimedia.org/T98006) [13:58:35] (03PS3) 10BBlack: authdns: move per-server monitors to 5353 [puppet] - 10https://gerrit.wikimedia.org/r/553109 (https://phabricator.wikimedia.org/T98006) [13:59:46] (03CR) 10Jcrespo: [C: 03+1] "You took too much time to merge, so make sure to edit filesets.pp instead of director.pp on rebase." [puppet] - 10https://gerrit.wikimedia.org/r/553058 (https://phabricator.wikimedia.org/T233936) (owner: 10Muehlenhoff) [13:59:50] (03CR) 10Ema: [C: 03+2] ATS: explicitly skip the cache instead of hiding CC [puppet] - 10https://gerrit.wikimedia.org/r/552076 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [14:00:11] (03PS2) 10Ema: ATS: do not coalesce uncacheable requests [puppet] - 10https://gerrit.wikimedia.org/r/552862 (https://phabricator.wikimedia.org/T238494) [14:03:43] (03CR) 10Ema: [C: 03+2] ATS: do not coalesce uncacheable requests [puppet] - 10https://gerrit.wikimedia.org/r/552862 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [14:05:44] (03PS1) 10Arturo Borrero Gonzalez: toolforge: prometheus: add job for nginx metrics in the front proxy [puppet] - 10https://gerrit.wikimedia.org/r/553113 (https://phabricator.wikimedia.org/T237643) [14:05:57] !log cp3050: depool to merge and test https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552862/ T238494 [14:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:02] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [14:06:35] (03CR) 10BBlack: [C: 03+2] Add and use check_dns_query [puppet] - 10https://gerrit.wikimedia.org/r/552935 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [14:06:45] (03PS2) 10Arturo Borrero Gonzalez: toolforge: prometheus: add job for nginx metrics in the front proxy [puppet] - 10https://gerrit.wikimedia.org/r/553113 (https://phabricator.wikimedia.org/T237643) [14:09:13] 10Operations, 10DNS, 10Traffic: Create wildcard DNS record for Wikimedia projects - https://phabricator.wikimedia.org/T238825 (10Bugreporter) >>! In T238825#5691807, @Dzahn wrote: > Really wildcard or more like "populate DNS (langlist.tmpl) with all language codes from [[ https://iso639-3.sil.org/sites/iso63... [14:09:47] (03PS1) 10Muehlenhoff: Fix rsync devices path [puppet] - 10https://gerrit.wikimedia.org/r/553114 [14:09:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: prometheus: add job for nginx metrics in the front proxy [puppet] - 10https://gerrit.wikimedia.org/r/553113 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [14:11:02] I broke icinga config again, FYI for whomever notices before I fix it [14:14:11] (03CR) 10Muehlenhoff: [C: 03+2] Fix rsync devices path [puppet] - 10https://gerrit.wikimedia.org/r/553114 (owner: 10Muehlenhoff) [14:19:57] (03PS1) 10Arturo Borrero Gonzalez: toolforge: prometheus: fix port for nginx exporter [puppet] - 10https://gerrit.wikimedia.org/r/553117 (https://phabricator.wikimedia.org/T237643) [14:21:53] 10Operations, 10DNS, 10Traffic: Create wildcard DNS record for Wikimedia projects - https://phabricator.wikimedia.org/T238825 (10BBlack) I could go either way on the subject of explicit langlist vs wildcard, really, so long as we're confident the MediaWiki layer handles all unknown language codes sanely, inc... [14:22:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: prometheus: fix port for nginx exporter [puppet] - 10https://gerrit.wikimedia.org/r/553117 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [14:23:18] (03CR) 10DCausse: [C: 03+1] query_service: use the correct script for autodeployment [puppet] - 10https://gerrit.wikimedia.org/r/553063 (owner: 10Mathew.onipe) [14:23:49] <_joe_> jouncebot: next [14:23:49] In 1 hour(s) and 36 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191126T1600) [14:24:08] <_joe_> mobrovac: I'm merging https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/548944/ [14:24:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] allow different memory limit settings for parsoid-php servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548944 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [14:25:02] (03Merged) 10jenkins-bot: allow different memory limit settings for parsoid-php servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548944 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [14:26:53] 10Operations, 10ops-esams: apply asset tags to cable managers - https://phabricator.wikimedia.org/T238835 (10mark) All 7 cable managers have been asset tagged and put into Netbox with the appropriate info and rack position. [14:27:16] _joe_: +1 [14:28:00] <_joe_> now testing on mwdebug1001 [14:28:45] poor wikibugs. wasn't me this time! [14:28:55] (03PS1) 10BBlack: Move check_dns_query to separate cfg [puppet] - 10https://gerrit.wikimedia.org/r/553119 (https://phabricator.wikimedia.org/T98006) [14:30:48] !log oblivian@deploy1001 Scap failed!: Call to mwscript eval.php stderr: not empty [14:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:11] <_joe_> sigh [14:31:21] (03CR) 10BBlack: [C: 03+2] Move check_dns_query to separate cfg [puppet] - 10https://gerrit.wikimedia.org/r/553119 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [14:31:29] <_joe_> that's from not finding SERVERGROUP [14:32:26] <_joe_> ofc from cli there is no such variable [14:33:40] 10Operations, 10GLOW, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T238868 (10jbond) ID.wikipedia, [x] main [x] mobile SU.wikipedia [x] main [x] mobile JV.wikipedia [x] main [x] mobile MIN.wikipedia [x] main [x] mobile AR.wikipedia [x]... [14:35:37] (03PS1) 10Giuseppe Lavagetto: Check existence of the SERVERGROUP env variable before using it. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553120 [14:35:41] <_joe_> mobrovac: ^^ [14:36:16] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10mark) I've filled out all red cells in the (original) bootstrap spreadsheet. [14:36:41] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10mark) [14:37:55] (03CR) 10Mobrovac: [C: 03+1] Check existence of the SERVERGROUP env variable before using it. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553120 (owner: 10Giuseppe Lavagetto) [14:37:57] 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10fgiunchedi) First thank you for getting the ball rolling on this proposal! A question: are all approaches proposed targe... [14:37:59] hehe _joe_ [14:38:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Check existence of the SERVERGROUP env variable before using it. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553120 (owner: 10Giuseppe Lavagetto) [14:39:28] (03Merged) 10jenkins-bot: Check existence of the SERVERGROUP env variable before using it. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553120 (owner: 10Giuseppe Lavagetto) [14:42:04] FYI I just redefined a bunch of DNS healthchecks. If I messed them up and we get lots of DNS alerts soon, don't assume the sky is falling, assume the check is bad :) [14:42:14] !log oblivian@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Raise memory limit on parsoid servers 1/2 (duration: 00m 51s) [14:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:05] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: Wiki Loves Africa Mailing List - https://phabricator.wikimedia.org/T239240 (10Ammarpad) [14:44:20] !log oblivian@deploy1001 Synchronized wmf-config/CommonSettings.php: Raise memory limit on parsoid servers 2/2 (duration: 00m 52s) [14:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:28] (03PS1) 10Elukey: Introduce systemd cgroup memory limits for stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/553122 (https://phabricator.wikimedia.org/T210098) [14:45:28] (03CR) 10Filippo Giunchedi: install_server: use raid1-gpt-lvm-ext4-srv.cfg recipe for mw* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553095 (https://phabricator.wikimedia.org/T156955) (owner: 10Effie Mouzeli) [14:45:53] (03PS2) 10Elukey: Introduce systemd cgroup memory limits for stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/553122 (https://phabricator.wikimedia.org/T210098) [14:46:08] (03PS1) 10Ema: Revert "ATS: explicitly skip the cache instead of hiding CC" [puppet] - 10https://gerrit.wikimedia.org/r/553123 [14:46:13] 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10Joe) >>! In T238909#5693597, @fgiunchedi wrote: > From my POV there's great value in having a single solution for load b... [14:46:33] (03CR) 10jerkins-bot: [V: 04-1] Revert "ATS: explicitly skip the cache instead of hiding CC" [puppet] - 10https://gerrit.wikimedia.org/r/553123 (owner: 10Ema) [14:46:54] (new DNS checks seem to be working fine!) [14:52:28] (03CR) 10BBlack: [C: 03+2] authdns: configure explicit service/monitor addrs [puppet] - 10https://gerrit.wikimedia.org/r/553108 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [14:53:36] !log rolling through authdns daemon restarts (necessary to reconfigure ANY-address listener) on authdns1001, authdns2001, ganeti3003 [14:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:27] (03PS2) 10Ema: Revert "ATS: explicitly skip the cache instead of hiding CC" [puppet] - 10https://gerrit.wikimedia.org/r/553123 (https://phabricator.wikimedia.org/T238494) [14:55:45] !log ignore previous message, restarts not necessary [14:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:56] who knew, someone thought ahead when they designed this stuff and it Just Worked :P [14:55:59] 10Operations, 10Traffic: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10herron) Will this Ganeti cluster use vlan tagged interfaces and associated public/private bridges, or will separate physical interfaces connect to both public and private vlans? If tagging, are the switch... [14:56:15] (03CR) 10jerkins-bot: [V: 04-1] Revert "ATS: explicitly skip the cache instead of hiding CC" [puppet] - 10https://gerrit.wikimedia.org/r/553123 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [14:59:05] 10Operations, 10Traffic: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10BBlack) I think we'll keep them private-vlan only and no tagging, and for the rare cases of "public" service instances we'll use LVS to route the traffic (same for all the edge-site ganeti). [14:59:24] (03CR) 10Muehlenhoff: [C: 03+1] install_server: use raid1-gpt-lvm-ext4-srv.cfg recipe for mw* (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553095 (https://phabricator.wikimedia.org/T156955) (owner: 10Effie Mouzeli) [14:59:28] (03PS1) 10Ema: Revert "ATS: do not coalesce uncacheable requests" [puppet] - 10https://gerrit.wikimedia.org/r/553125 (https://phabricator.wikimedia.org/T238494) [14:59:32] 10Operations, 10Parsoid-PHP, 10serviceops, 10Patch-For-Review, 10Wikimedia-production-error: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10Joe) Patches are merged and the memory limit is now at 760 MB, as confirmed by the current OOMs. [15:00:09] 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10faidon) a:05faidon→03RobH Approved. [15:00:20] (03PS3) 10Elukey: Introduce systemd cgroup memory limits for stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/553122 (https://phabricator.wikimedia.org/T210098) [15:00:41] 10Operations, 10ops-esams: Relabel cables with duplicate IDs - https://phabricator.wikimedia.org/T237006 (10mark) 05Open→03Resolved a:03mark All duplicate ids have been fixed, labels replaced for one pair and updated in netbox. [15:01:49] (03CR) 10Ema: [C: 03+2] Revert "ATS: do not coalesce uncacheable requests" [puppet] - 10https://gerrit.wikimedia.org/r/553125 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [15:03:29] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/19626/" [puppet] - 10https://gerrit.wikimedia.org/r/553122 (https://phabricator.wikimedia.org/T210098) (owner: 10Elukey) [15:03:58] (03PS5) 10Muehlenhoff: Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857 [15:04:01] (03CR) 10Muehlenhoff: Setup systemd timer for rsync of U2F devices config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552857 (owner: 10Muehlenhoff) [15:04:22] (03CR) 10jerkins-bot: [V: 04-1] Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857 (owner: 10Muehlenhoff) [15:04:48] (03PS3) 10Ema: Revert "ATS: explicitly skip the cache instead of hiding CC" [puppet] - 10https://gerrit.wikimedia.org/r/553123 (https://phabricator.wikimedia.org/T238494) [15:05:07] (03PS6) 10Muehlenhoff: Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857 [15:05:39] (03PS4) 10Elukey: Introduce systemd cgroup memory limits for stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/553122 (https://phabricator.wikimedia.org/T210098) [15:05:45] (03PS1) 10Andrew Bogott: Bump version number [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/553127 [15:06:29] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/552857 (owner: 10Muehlenhoff) [15:07:10] (03PS5) 10Elukey: Introduce systemd cgroup memory limits for stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/553122 (https://phabricator.wikimedia.org/T212824) [15:07:24] (03CR) 10Andrew Bogott: [C: 03+2] Bump version number [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/553127 (owner: 10Andrew Bogott) [15:07:28] (03CR) 10Ema: [C: 03+2] Revert "ATS: explicitly skip the cache instead of hiding CC" [puppet] - 10https://gerrit.wikimedia.org/r/553123 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [15:07:43] (03CR) 10Filippo Giunchedi: install_server: use raid1-gpt-lvm-ext4-srv.cfg recipe for mw* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553095 (https://phabricator.wikimedia.org/T156955) (owner: 10Effie Mouzeli) [15:12:34] (03CR) 10Muehlenhoff: "That's actually a good thing, as I still to update something anyway :-)" [puppet] - 10https://gerrit.wikimedia.org/r/553058 (https://phabricator.wikimedia.org/T233936) (owner: 10Muehlenhoff) [15:13:41] 10Operations, 10Traffic: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10herron) Ok, for my own edification, how would the private only LVS model work if we wanted to stand up a public facing non HTTP(S) service in a VM at one+ of these sites? [15:14:19] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, and 2 others: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10elukey) [15:15:26] !log cp3050: repool after failed test of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552862/ (reverted) T238494 [15:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:36] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [15:16:50] (03CR) 10Jbond: [C: 03+1] "lgtm optional nit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553122 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [15:17:22] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: alert on low job availability [puppet] - 10https://gerrit.wikimedia.org/r/552521 (https://phabricator.wikimedia.org/T187708) (owner: 10Filippo Giunchedi) [15:17:32] (03PS2) 10Filippo Giunchedi: prometheus: alert on low job availability [puppet] - 10https://gerrit.wikimedia.org/r/552521 (https://phabricator.wikimedia.org/T187708) [15:22:02] (03CR) 10Elukey: "Thanks John, indeed all users supports only stretch and stat1005 is already on buster. Going to send a separate change to support it!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553122 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [15:24:20] (03PS5) 10Herron: logstash: create elk7 logstash role and assign to elk7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) [15:24:23] (03CR) 10Herron: logstash: create elk7 logstash role and assign to elk7 collectors (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:26:01] 10Operations, 10netbox: Netbox report check for no position set in rack - https://phabricator.wikimedia.org/T239244 (10faidon) p:05Triage→03Normal [15:29:22] (03PS6) 10Elukey: Introduce systemd cgroup memory limits for stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/553122 (https://phabricator.wikimedia.org/T212824) [15:29:26] 10Operations, 10Traffic, 10netops, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10CDanis) >>! In T224888#5692672, @fgiunchedi wrote: > Sounds great to me! I am assuming on the icinga side it'll be only one alert at least to start with, for e... [15:31:43] (03CR) 10Muehlenhoff: [C: 03+2] Setup systemd timer for rsync of U2F devices config [puppet] - 10https://gerrit.wikimedia.org/r/552857 (owner: 10Muehlenhoff) [15:32:03] (03PS7) 10Elukey: Introduce systemd cgroup memory limits for stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/553122 (https://phabricator.wikimedia.org/T212824) [15:32:40] (03CR) 10CDanis: "just FYI: not sure if your intent was to use the stunnel-wrapping, but as written, this won't" [puppet] - 10https://gerrit.wikimedia.org/r/552857 (owner: 10Muehlenhoff) [15:32:45] jbond42: thanks for the review! I followed your advice for the percentages, and also raised a a bit limits --^ [15:33:11] ack ill recheck [15:33:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/553122 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [15:34:01] thanks! [15:36:05] (03CR) 10Elukey: [C: 03+2] Introduce systemd cgroup memory limits for stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/553122 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [15:37:05] (03PS2) 10Marostegui: db2125: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552971 (https://phabricator.wikimedia.org/T239042) [15:38:15] (03CR) 10Marostegui: [C: 03+2] db2125: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/552971 (https://phabricator.wikimedia.org/T239042) (owner: 10Marostegui) [15:39:06] nope something didn't go well for stat1004 [15:39:48] 10Operations, 10Traffic, 10netops, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10fgiunchedi) >>! In T224888#5693759, @CDanis wrote: > > Any preferences or thoughts re: the special tag? Right now I'm leaning towards `#page` as that seems t... [15:41:15] elukey: did you see dbstore1003's raid? [15:41:28] (03PS1) 10Andrew Bogott: wmf_sink: don't commit to instance-puppet if nothing has changed [puppet] - 10https://gerrit.wikimedia.org/r/553130 (https://phabricator.wikimedia.org/T238708) [15:42:52] marostegui: I did yes, I thought that we were waiting for dcops to order a disk no? [15:43:08] elukey: yeah, but not sure if we should assign directly to john or...? [15:43:18] elukey: or comment on the task saying it is fine to proceed? [15:43:43] (03PS1) 10Ema: ATS: disable coalescing for some uncacheable requests [puppet] - 10https://gerrit.wikimedia.org/r/553132 (https://phabricator.wikimedia.org/T238494) [15:43:56] marostegui: sure I'll do it [15:44:03] elukey: <3 [15:49:04] jbond42: I think that systemd needs the specific % char after the value, the docs are weird [15:49:49] otherwise it thinks those are bytes :D [15:50:05] 10Operations, 10netops, 10observability: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 (10fgiunchedi) FTR, re: paging on librenms alerts, see this plan: https://phabricator.wikimedia.org/T224888#5690188 [15:50:22] elukey: yes i was unsure about if it needed that or not [15:50:36] it says its needed for the cpu stuff but very unclear on the memory [15:50:41] (03PS6) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) [15:50:43] (03PS4) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [15:51:05] (03PS1) 10Elukey: profile::analytics::client::limits: set percentage char in user slice [puppet] - 10https://gerrit.wikimedia.org/r/553134 (https://phabricator.wikimedia.org/T212824) [15:51:12] jbond42: fixing it now with a code change, ssh doesn't work due to the cgroup enforcing limits, it works :D [15:51:25] oops :) [15:51:32] let me know if you need a +1 [15:51:41] (03CR) 10Elukey: [C: 03+2] profile::analytics::client::limits: set percentage char in user slice [puppet] - 10https://gerrit.wikimedia.org/r/553134 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [15:52:18] jbond42: nono my bad, should have looked for examples, the doc was too ambiguous [15:54:21] ok now works :) [15:54:34] yay, :) [15:57:43] (03PS4) 10BBlack: authdns: move per-server monitors to 5353 [puppet] - 10https://gerrit.wikimedia.org/r/553109 (https://phabricator.wikimedia.org/T98006) [15:57:45] (03PS1) 10BBlack: authdns: add ferm rules for 5353 [puppet] - 10https://gerrit.wikimedia.org/r/553135 (https://phabricator.wikimedia.org/T98006) [15:58:46] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10mark) cr2-esams now has its power cables labeled: PREM0: 20145 (to ps1) PREM1: 20146 (to ps2) PREM2: 20147 (to ps1) PREM3: 20148 (to ps2) [15:59:13] (03PS1) 10RLazarus: poolcounter: Listen for prometheus on both IPv4 and IPv6. [puppet] - 10https://gerrit.wikimedia.org/r/553137 (https://phabricator.wikimedia.org/T237407) [15:59:44] (03CR) 10jerkins-bot: [V: 04-1] authdns: add ferm rules for 5353 [puppet] - 10https://gerrit.wikimedia.org/r/553135 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [16:00:04] godog and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191126T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:26] !log cp3050: temporarily disable request coalescing to assess performance impact T238494 [16:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:30] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [16:02:56] PROBLEM - Check systemd state on idp2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:13] looking [16:03:50] PROBLEM - Check the last execution of idp-u2f-sync on idp2001 is CRITICAL: CRITICAL: Status of the systemd unit idp-u2f-sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:05:30] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, and 2 others: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10elukey) Applied the change to stat1004 and executed a little test: ` a = [] while True: print(len(a)) a.append(' ' * 10**6)... [16:05:36] jbond42: --^ \o/ [16:06:37] (03PS1) 10Muehlenhoff: Add missing protocol prefix to idp-u2f-sync systemd::timer [puppet] - 10https://gerrit.wikimedia.org/r/553139 [16:07:45] cool :) [16:08:01] (03PS2) 10BBlack: authdns: add ferm rules for 5353 [puppet] - 10https://gerrit.wikimedia.org/r/553135 (https://phabricator.wikimedia.org/T98006) [16:08:02] (03PS5) 10BBlack: authdns: move per-server monitors to 5353 [puppet] - 10https://gerrit.wikimedia.org/r/553109 (https://phabricator.wikimedia.org/T98006) [16:08:34] (03PS2) 10CRusnov: hieradata/netbox: Add accounting report to alerts [puppet] - 10https://gerrit.wikimedia.org/r/550053 [16:09:12] !log ssastry@deploy1001 Started deploy [parsoid/deploy@ee63341]: Testing rollback fixes (T238685) [16:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:17] T238685: php-fpm isn't restarted when deploys are rolled back - https://phabricator.wikimedia.org/T238685 [16:09:51] (03PS1) 10Elukey: systemd::slice::all_users: add Debian Buster support [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T234826) [16:10:20] !log ssastry@deploy1001 Finished deploy [parsoid/deploy@ee63341]: Testing rollback fixes (T238685) (duration: 01m 07s) [16:10:26] mobrovac, ok, rolled back. [16:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:31] let me check logs. [16:11:06] (03CR) 10Muehlenhoff: [C: 03+2] Add missing protocol prefix to idp-u2f-sync systemd::timer [puppet] - 10https://gerrit.wikimedia.org/r/553139 (owner: 10Muehlenhoff) [16:11:15] 10Operations, 10Traffic, 10netops, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10ayounsi) That looks good! We might want to create a specific LibreNMS alert for the transit/peering links only, but can start with the existing ones. Note tha... [16:11:42] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10mark) cr3-esams now has its power cables labeled: PEM0: 20149 (to ps1 outlet 36) PEM1: 20150 (to ps2 outlet 36) PEM2: 20151 (to ps1 outlet 35) PEM3: 20152 (to ps2 outlet 35) [16:12:24] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10mark) [16:13:55] 10Operations, 10Traffic, 10netops, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10CDanis) >>! In T224888#5693928, @ayounsi wrote: > That looks good! We might want to create a specific LibreNMS alert for the transit/peering links only, but ca... [16:14:03] (03PS3) 10Jforrester: Drop HHVMRequestInit, never called [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542184 (https://phabricator.wikimedia.org/T235142) [16:17:28] (03CR) 10Elukey: "No op for stat1004 (stretch): https://puppet-compiler.wmflabs.org/compiler1002/19629/" [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [16:18:55] (03PS2) 10Elukey: systemd::slice::all_users: add Debian Buster support [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T234826) [16:19:14] !log ssastry@deploy1001 Started deploy [parsoid/deploy@ee63341]: Update Parsoid to 7b9b424a [16:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:19] (03CR) 10Muehlenhoff: systemd::slice::all_users: add Debian Buster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [16:22:19] moritzm: should I explicitly fail for jessie stating that it is not supported ? --^ [16:23:16] (03PS1) 10Jforrester: CommonSettings: Drop Scribunto special-case for HHVM, never reached [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553144 (https://phabricator.wikimedia.org/T235142) [16:24:27] went for explicit failure in a "else" case [16:24:27] (03PS3) 10Elukey: systemd::slice::all_users: add Debian Buster support [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) [16:24:43] (03CR) 10Jbond: "thanks for the quick turn around on this :) minor inline comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [16:25:43] (03CR) 10Jbond: systemd::slice::all_users: add Debian Buster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [16:26:17] (03CR) 10jerkins-bot: [V: 04-1] systemd::slice::all_users: add Debian Buster support [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [16:26:33] (03CR) 10Jforrester: [C: 03+2] Drop HHVMRequestInit, never called [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542184 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [16:27:06] (03CR) 10Elukey: systemd::slice::all_users: add Debian Buster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [16:27:15] (03Merged) 10jenkins-bot: Drop HHVMRequestInit, never called [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542184 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [16:27:52] !log ssastry@deploy1001 Finished deploy [parsoid/deploy@ee63341]: Update Parsoid to 7b9b424a (duration: 08m 37s) [16:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:56] (03PS7) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) [16:27:58] (03PS5) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [16:30:09] (03PS8) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) [16:30:11] (03PS6) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [16:30:40] !log jforrester@deploy1001 Synchronized docroot/noc/conf/: Drop HHVMRequestInit symlink (duration: 00m 52s) [16:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:34] !log No sane way to delete HHVMRequestInit.php with a simple sync-dir, so waiting for the full scap. [16:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:35] (03PS4) 10Elukey: systemd::slice::all_users: add Debian Buster support [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) [16:32:45] !log jforrester@deploy1001 Synchronized docroot/noc/createTxtFileSymlinks.sh: Drop HHVMRequestInit symlink creation (duration: 00m 52s) [16:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:29] (03CR) 10Jforrester: [C: 03+2] CommonSettings: Drop Scribunto special-case for HHVM, never reached [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553144 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [16:34:11] (03Merged) 10jenkins-bot: CommonSettings: Drop Scribunto special-case for HHVM, never reached [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553144 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [16:34:22] (03CR) 10Jbond: [C: 03+1] "lgtm will leave the installed/latest/present decision to you :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [16:36:27] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Drop Scribunto special-case for HHVM, never reached T235142 (duration: 00m 52s) [16:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:34] T235142: Clean up mediawiki-config from HHVM-related configuration - https://phabricator.wikimedia.org/T235142 [16:37:03] (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [16:38:33] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:46] tis me [16:39:01] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Verify SSL certs against the domain in the Host: header. (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/551250 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [16:39:10] 10Operations, 10ops-esams: apply asset tags to cable managers - https://phabricator.wikimedia.org/T238835 (10mark) 05Open→03Resolved [16:40:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] puppet_ca: update puppet ca with a new certificate valid for 10 years [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond) [16:41:35] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, and 2 others: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10elukey) I think we are on a good track, next steps: 1) add support for Buster (for stat1005) - needs Cloud team's review/approval:... [16:41:39] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:00] (03CR) 10BBlack: [C: 03+2] authdns: add ferm rules for 5353 [puppet] - 10https://gerrit.wikimedia.org/r/553135 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [16:42:02] (03CR) 10CDanis: [C: 03+1] poolcounter: Listen for prometheus on both IPv4 and IPv6. [puppet] - 10https://gerrit.wikimedia.org/r/553137 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [16:46:32] (03CR) 10BBlack: [C: 03+2] authdns: move per-server monitors to 5353 [puppet] - 10https://gerrit.wikimedia.org/r/553109 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [16:50:50] (03CR) 10Giuseppe Lavagetto: envoy-tls: proxy the admin interface too. (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/553045 (owner: 10Giuseppe Lavagetto) [16:53:00] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I think one of the big potential advantages of this approach is making easy to check multiple hosts." [software/httpbb] - 10https://gerrit.wikimedia.org/r/551283 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [16:55:04] (03CR) 10CDanis: envoy-tls: proxy the admin interface too. (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/553045 (owner: 10Giuseppe Lavagetto) [16:57:02] (03PS1) 10Cwhite: hiera: set mtail to run as group adm on lists [puppet] - 10https://gerrit.wikimedia.org/r/553147 (https://phabricator.wikimedia.org/T236505) [16:58:31] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10crusnov) After a conversation with @Volans an extended ask is having the generator able to add and remove files (eg, override completely the c... [17:00:04] cscott, arlolra, subbu, halfak, and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191126T1700). [17:00:15] (03PS3) 10CRusnov: coherence: Check device names for correct case [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552874 (https://phabricator.wikimedia.org/T237469) [17:01:40] (03PS2) 10Giuseppe Lavagetto: envoy-tls: proxy /stats from the admin interface. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/553045 [17:03:16] !log cr2-esams: disable interface xe-0/0/2 (transit) [17:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:20] er [17:03:24] PROBLEM - Host scs-oe16-esams is DOWN: PING CRITICAL - Packet loss = 100% [17:03:36] !log above was for cr3-esams [17:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:46] alert expected? [17:03:50] no, but working on it [17:03:56] ok [17:06:08] RECOVERY - Check the last execution of idp-u2f-sync on idp2001 is OK: OK: Status of the systemd unit idp-u2f-sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:06:24] RECOVERY - Check systemd state on idp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:56] RECOVERY - Host scs-oe16-esams is UP: PING OK - Packet loss = 0%, RTA = 83.81 ms [17:10:06] !log ppchelko@deploy1001 Started deploy [restbase/deploy@0b74625]: Switch group 0 and 1 to Parsoid-PHP T229015 [17:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:10] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [17:11:50] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:13:31] (03PS5) 10Dzahn: admins: add Max Semenik as ldap_only_admin [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) [17:13:34] (03CR) 10CRusnov: coherence: Check device names for correct case (032 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552874 (https://phabricator.wikimedia.org/T237469) (owner: 10CRusnov) [17:14:42] 10Operations, 10ops-codfw: codfw: rack/setup/install mc203[7,8,9].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10Papaul) [17:14:47] (03CR) 10jerkins-bot: [V: 04-1] admins: add Max Semenik as ldap_only_admin [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn) [17:15:02] 10Operations, 10ops-codfw: codfw: rack/setup/install mc203[7,8,9].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10Papaul) p:05Triage→03Normal [17:15:53] 10Operations, 10ops-eqiad: setup/install postgres db system - https://phabricator.wikimedia.org/T239250 (10RobH) [17:16:32] 10Operations, 10ops-eqiad: setup/install postgres db system - https://phabricator.wikimedia.org/T239250 (10RobH) @ssingh: I assigned this to you for feedback on the hostname. It should be reassigned to @Jclark-ctr for implementation when updated. Thanks! [17:17:51] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) The experiment with request coalescing disabled on cp3050 is running. Meanwhile I've noticed something likely interesting abo... [17:17:54] (03PS1) 10BBlack: authdns: port 53 monitor made optional [puppet] - 10https://gerrit.wikimedia.org/r/553150 [17:19:42] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:19:43] (03CR) 10jerkins-bot: [V: 04-1] authdns: port 53 monitor made optional [puppet] - 10https://gerrit.wikimedia.org/r/553150 (owner: 10BBlack) [17:22:54] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) > Batch editing the DB The update should be: ` UPDATE Media SET StorageId = 11 WHERE StorageId = 4; ` But I need to take a backup and check there is... [17:23:02] 10Operations, 10ops-eqiad: setup/install censorship1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10RobH) p:05Triage→03Normal a:05ssingh→03Jclark-ctr [17:23:28] (03PS2) 10BBlack: authdns: port 53 monitor made optional [puppet] - 10https://gerrit.wikimedia.org/r/553150 [17:24:01] 10Operations, 10ops-eqiad: setup/install censorship1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10RobH) [17:24:52] 10Operations, 10ops-eqiad: setup/install censorship1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10RobH) IRC Update: hostname decision via irc duscussion with @ssingh resulted in 'censorship1001' as hostname. Stole task back and updated, then pushed to John for implementation when the SATA dis... [17:25:40] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:25:43] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@0b74625]: Switch group 0 and 1 to Parsoid-PHP T229015 (duration: 15m 38s) [17:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:49] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [17:26:46] !log moving fiberring from cr3-esams:xe-0/0/2 to cr2-esams:xe-0/1/8 [17:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:16] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:28:13] q [17:28:59] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: set mtail to run as group adm on lists [puppet] - 10https://gerrit.wikimedia.org/r/553147 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [17:30:05] (03CR) 10RLazarus: [C: 03+2] poolcounter: Listen for prometheus on both IPv4 and IPv6. [puppet] - 10https://gerrit.wikimedia.org/r/553137 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [17:30:22] (03CR) 10Cwhite: [C: 03+2] hiera: set mtail to run as group adm on lists [puppet] - 10https://gerrit.wikimedia.org/r/553147 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [17:31:48] !log cutting branch for 1.35.0-wmf.8 [17:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:14] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10aborrero) In the WMCS team meeting, we decided to rename these servers to better reflect what they do... [17:33:16] (03CR) 10Dzahn: "it's not possible to do this without jenkins bot complaining that "12:14:03 'maxsem' : Absent users are both in "absent" group (first set)" [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn) [17:33:18] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) As per the cloud team meeting. The following hostname change will take place labmon1001 -> cl... [17:35:22] (03PS6) 10Dzahn: admins: add Max Semenik as ldap_only_admin [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) [17:35:41] 10Operations, 10Traffic, 10Readers-Web-Backlog (Needs Product Owner Decisions): [Bug] iPadOS 13 shows the desktop version of Safari with a broken layout - https://phabricator.wikimedia.org/T229875 (10dr0ptp4kt) @cchen copying you in for visibility. iPad iOS 13 is a desktop UA, in case that's useful info in o... [17:36:45] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10Papaul) qfx5100-spare1, psu 0 {#20156} to ps2-oe15-esams:17 qfx5100-spare2, psu 0 {#20157} to ps2-oe15-esams:16 qfx5100-spare1, psu 1 {#20159} to ps1-oe15-esams:2 qfx5100-spare2, psu 1 {#20158} to... [17:38:16] (03PS3) 10BBlack: authdns: port 53 monitor made optional [puppet] - 10https://gerrit.wikimedia.org/r/553150 [17:39:47] (03CR) 10Dzahn: [C: 03+2] create analytics-web.discovery.wmnet, point to thorium [dns] - 10https://gerrit.wikimedia.org/r/551938 (owner: 10Dzahn) [17:39:51] (03PS3) 10Dzahn: create analytics-web.discovery.wmnet, point to thorium [dns] - 10https://gerrit.wikimedia.org/r/551938 [17:42:42] RECOVERY - Wikitech and wt-static content in sync on labweb1002 is OK: wikitech-static OK - wikitech and wikitech-static in sync (100389 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [17:42:42] RECOVERY - Wikitech and wt-static content in sync on labweb1001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (100389 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [17:44:02] (03PS3) 10Dzahn: rename maintenance.discovery to mwmaint.discovery [dns] - 10https://gerrit.wikimedia.org/r/552944 (https://phabricator.wikimedia.org/T210411) [17:48:48] 10Operations, 10netops: Add monitoring for BGP peers exceeding prefix-limit - https://phabricator.wikimedia.org/T239256 (10ayounsi) p:05Triage→03Normal [17:49:02] (03PS1) 10Mholloway: MachineVision: Show UploadWizard CTA on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553153 (https://phabricator.wikimedia.org/T234960) [17:49:15] (03PS1) 10Mholloway: MachineVision: Show UploadWizard CTA in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553154 (https://phabricator.wikimedia.org/T234960) [17:51:14] (03CR) 10Jhedden: [C: 03+1] nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [17:52:29] 10Operations, 10ops-eqiad, 10DC-Ops: Asset tag remaining cablemgmt in eqiad - https://phabricator.wikimedia.org/T239110 (10Cmjohnson) 05Open→03Resolved @faidon All should have asset tags now and netbox updated. [17:52:39] (03CR) 10Mholloway: [C: 03+2] MachineVision: Show UploadWizard CTA in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553154 (https://phabricator.wikimedia.org/T234960) (owner: 10Mholloway) [17:53:21] (03Merged) 10jenkins-bot: MachineVision: Show UploadWizard CTA in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553154 (https://phabricator.wikimedia.org/T234960) (owner: 10Mholloway) [17:53:24] (03PS4) 10BBlack: authdns: port 53 monitor made optional [puppet] - 10https://gerrit.wikimedia.org/r/553150 [17:54:20] (03PS1) 10Giuseppe Lavagetto: Update blubberoid to workaround in telemetry collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/553155 [17:54:55] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10kaldari) I also belatedly endorse this, as Max's manager's manager :) [17:55:00] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10mark) scs1-oe15-esams:psu1 {#20163} to ps2-oe15-esams:34 scs1-oe15-esams:psu2 {#20164} to ps1-oe15-esams:34 asw2-oe16-esams:psu0 {#20162} to ps2-oe16-esams:26 asw2-oe16-esams:psu1 {#20164} to ps1... [17:56:29] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: MachineVision: Show UploadWizard CTA in beta (T234960) (duration: 00m 52s) [17:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:34] T234960: Add call to action on final step of Upload Wizard - https://phabricator.wikimedia.org/T234960 [17:57:17] (03CR) 10Dzahn: [C: 03+2] "I added this and it's not used yet. Changing name to "mwmaint" instead to match host names and certificate." [dns] - 10https://gerrit.wikimedia.org/r/552944 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191126T1800) [18:03:15] (03CR) 10CDanis: [C: 03+1] "I am not exactly a competent reviewer, but seems reasonable and LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/553155 (owner: 10Giuseppe Lavagetto) [18:04:38] (03CR) 10Jhedden: [C: 04-1] nova: add nova config for the placement service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [18:05:03] (03CR) 10Jhedden: [C: 04-1] nova: add support for the 'placement' api service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [18:05:42] (03PS1) 10Giuseppe Lavagetto: run_ci_locally.sh: use the latest image version. [puppet] - 10https://gerrit.wikimedia.org/r/553156 [18:08:13] (03CR) 10Andrew Bogott: nova: add support for the 'placement' api service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [18:08:38] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on dbstore1003 - https://phabricator.wikimedia.org/T239217 (10Cmjohnson) a:03Jclark-ctr I created a self-dispatch ticket. You have successfully submitted request SR1004377941. Assigning to @Jclark-ctr since I will be out of the area. [18:09:40] !log issues with branch.py branch cut; deleted stub wmf/1.35.0-wmf.8 branch and proceeding with standard process [18:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:06] (03CR) 10Dzahn: [C: 04-1] "blocked on first adding the name to the certificate / create a new certificate." [puppet] - 10https://gerrit.wikimedia.org/r/551939 (owner: 10Dzahn) [18:11:26] (03CR) 10Dzahn: [C: 04-1] "or not.. because how does it work with https://thorium.eqiad.wmnet now if that isn't on the cert" [puppet] - 10https://gerrit.wikimedia.org/r/551939 (owner: 10Dzahn) [18:12:18] (03CR) 10Jhedden: [C: 04-1] nova: add support for the 'placement' api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [18:13:06] jouncebot: next [18:13:06] In 0 hour(s) and 46 minute(s): Mediawiki train - American Version (strange Tuesday-only edition) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191126T1900) [18:14:49] (03CR) 10Jhedden: [C: 03+1] "I'd prefer using the native service ports on the load balancer, but it's mostly cosmetic and technically doesn't block anything." [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [18:16:14] (03CR) 10Andrew Bogott: "> I'd prefer using the native service ports on the load balancer" [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [18:16:56] (03PS9) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) [18:16:58] (03PS7) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [18:17:41] (03PS8) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [18:19:50] (03CR) 10Muehlenhoff: systemd::slice::all_users: add Debian Buster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [18:20:19] elukey: either that or use requires_os() [18:20:47] (03PS10) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) [18:20:49] (03PS9) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [18:21:43] (03CR) 10Subramanya Sastry: [C: 03+1] Parsoid: Switch groups 0 and 1 to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552972 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [18:26:56] 10Operations, 10ops-esams, 10netops: mr1-esams RMA - https://phabricator.wikimedia.org/T238174 (10RobH) IRC Update: * @mark is going to take home defective chassis for later reutn. * https://support.juniper.net/support/rma-locations/ lists the address, but the support case doesn't have a return label I ha... [18:29:04] (03PS5) 10Elukey: systemd::slice::all_users: add Debian Buster support [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) [18:29:50] (03CR) 10Elukey: "Moritz: Jessie should lead to a clean fail() now, and I replaced lastest with preset. Let me know if anything else is missing :)" [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [18:33:07] !log stop logstash on logstash200[5-6] for metrics collection - T215904 [18:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:13] T215904: Better understanding of Logstash performance - https://phabricator.wikimedia.org/T215904 [18:34:40] (03PS6) 10Jforrester: Variant configuration: Generate dblists from YAML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) [18:36:01] (03CR) 10Jhedden: [C: 03+1] nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [18:36:49] (03CR) 10Dzahn: "While we agree that the jenkins -1 is a bug in CI, just overriding it also doesn't seem to be a good idea because puppet-compiler says" [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn) [18:37:48] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=jmx_logstash site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:38:52] (03PS7) 10Dzahn: admins: add Max Semenik as ldap_only_admin [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) [18:41:25] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/19635/" [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn) [18:44:16] (03CR) 10Dzahn: [C: 03+1] "planning to merge like this once Rachel confirms NDA" [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn) [18:44:50] !log temporarily update pipeline.batch.size to 1000 on logstash2004 - T215904 [18:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:55] T215904: Better understanding of Logstash performance - https://phabricator.wikimedia.org/T215904 [18:50:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] run_ci_locally.sh: use the latest image version. [puppet] - 10https://gerrit.wikimedia.org/r/553156 (owner: 10Giuseppe Lavagetto) [18:50:04] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552874 (https://phabricator.wikimedia.org/T237469) (owner: 10CRusnov) [18:52:49] (03PS1) 10Brennen Bearnes: Group0 to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553159 [18:55:46] !log stop logstash codfw, generate some consumer lag - T215904 [18:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:52] T215904: Better understanding of Logstash performance - https://phabricator.wikimedia.org/T215904 [18:57:11] (03PS11) 10Andrew Bogott: nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) [18:59:18] (03CR) 10Andrew Bogott: [C: 03+2] nova: add support for the 'placement' api service [puppet] - 10https://gerrit.wikimedia.org/r/552890 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [19:00:04] brennen and twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American Version (strange Tuesday-only edition) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191126T1900). [19:00:39] welcome back releng ^ [19:00:47] here [19:00:51] thanks mutante [19:01:07] thanks mutante. :) [19:01:49] (03PS6) 10Herron: logstash: create elk7 logstash role and assign to elk7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) [19:02:10] (03PS10) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [19:02:12] (03PS1) 10Andrew Bogott: nova-placement service: Fix type of 'active' param [puppet] - 10https://gerrit.wikimedia.org/r/553161 (https://phabricator.wikimedia.org/T239161) [19:03:17] !log ebernhardson@deploy1001 Started deploy [search/airflow@d9779a9]: redeploy current version [19:03:20] !log ebernhardson@deploy1001 Finished deploy [search/airflow@d9779a9]: redeploy current version (duration: 00m 02s) [19:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:39] !log ebernhardson@deploy1001 Started deploy [search/airflow@d9779a9]: redeploy current version [19:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:42] (03CR) 10Andrew Bogott: [C: 03+2] nova-placement service: Fix type of 'active' param [puppet] - 10https://gerrit.wikimedia.org/r/553161 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [19:03:44] !log ebernhardson@deploy1001 Finished deploy [search/airflow@d9779a9]: redeploy current version (duration: 00m 05s) [19:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:31] !log brennen@deploy1001 Pruned MediaWiki: 1.35.0-wmf.2 (duration: 07m 08s) [19:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:46] (03PS11) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [19:05:48] (03PS1) 10Andrew Bogott: nova-placement-api: fix service name [puppet] - 10https://gerrit.wikimedia.org/r/553164 (https://phabricator.wikimedia.org/T239161) [19:06:44] (03CR) 10Andrew Bogott: [C: 03+2] nova-placement-api: fix service name [puppet] - 10https://gerrit.wikimedia.org/r/553164 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [19:06:46] !log brennen@deploy1001 Started scap: testwiki to php-1.35.0-wmf.8 and rebuild l10n cache [19:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:05] twentyafterfour: slightly behind here, just syncing testwiki now. [19:07:27] !log ebernhardson@deploy1001 Started deploy [search/airflow@6ab2cd1]: Align deploy groups in scap.cfg and checks.yaml [19:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:56] !log ebernhardson@deploy1001 Finished deploy [search/airflow@6ab2cd1]: Align deploy groups in scap.cfg and checks.yaml (duration: 00m 29s) [19:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:57] !log stop logstash codfw, generate some consumer lag, and set batch size to 2000 - T215904 [19:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:02] T215904: Better understanding of Logstash performance - https://phabricator.wikimedia.org/T215904 [19:10:21] (03CR) 10CRusnov: [C: 03+2] coherence: Check device names for correct case [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552874 (https://phabricator.wikimedia.org/T237469) (owner: 10CRusnov) [19:10:27] (03PS4) 10CRusnov: coherence: Check device names for correct case [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/552874 (https://phabricator.wikimedia.org/T237469) [19:14:00] (03PS1) 10RLazarus: prometheus: Scrape the poolcounter exporters. [puppet] - 10https://gerrit.wikimedia.org/r/553168 (https://phabricator.wikimedia.org/T237407) [19:14:40] (03PS12) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [19:19:17] (03PS1) 10CRusnov: Revert "coherence: Check device names for correct case" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/553169 [19:20:41] (03PS2) 10CRusnov: Revert "coherence: Check device names for correct case" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/553169 [19:22:49] !log restore codfw logstash to baseline - T215904 [19:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:54] T215904: Better understanding of Logstash performance - https://phabricator.wikimedia.org/T215904 [19:23:14] 10Operations, 10Wikimedia-Mailing-lists, 10Release-Engineering-Team-TODO (201911), 10User-zeljkofilipin: Close QA mailing list - https://phabricator.wikimedia.org/T237383 (10Jdforrester-WMF) 05Open→03Resolved [19:24:07] (03CR) 10CDanis: [C: 03+1] "PCC is a little funny, but I think that's PCC being PCC, rather than an actual error?" [puppet] - 10https://gerrit.wikimedia.org/r/553168 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [19:24:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:25:22] (03CR) 10Herron: "updated PCC looks good ("input_kafka_consumer_group_id": "logstash7-eqiad" is in the change catalog) https://puppet-compiler.wmflabs.org/c" [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [19:25:35] Hey anybody - just wanted a sanity-check here that merges to master for annual.w.o should get deployed every 30 minutes or so w/ puppet, correct? https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/microsites/annualreport.pp [19:26:03] (03CR) 10Herron: "> updated PCC looks good ("input_kafka_consumer_group_id":" [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [19:26:16] sbassett: that is my read as well [19:26:25] cdanis: great, thanks. [19:27:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1099:3311 after schema change', diff saved to https://phabricator.wikimedia.org/P9753 and previous config saved to /var/cache/conftool/dbconfig/20191126-192724-marostegui.json [19:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:54] (03PS7) 10Herron: logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) [19:39:38] !log brennen@deploy1001 Finished scap: testwiki to php-1.35.0-wmf.8 and rebuild l10n cache (duration: 32m 52s) [19:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:22] (03CR) 10Brennen Bearnes: [C: 03+2] Group0 to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553159 (owner: 10Brennen Bearnes) [19:42:05] (03Merged) 10jenkins-bot: Group0 to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553159 (owner: 10Brennen Bearnes) [19:46:37] (03Abandoned) 10CRusnov: Revert "coherence: Check device names for correct case" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/553169 (owner: 10CRusnov) [19:46:54] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.8 [19:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:11] (03PS5) 10BBlack: authdns: port 53 monitor made optional [puppet] - 10https://gerrit.wikimedia.org/r/553150 [19:57:53] !log Reset email of TheklanBot (T239233) [19:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:58] T239233: Reset password for bot TheklanBot - https://phabricator.wikimedia.org/T239233 [19:59:01] !log create partitioned topics for cirrusSearchElasticaWrite on kafka-main T239135 [19:59:03] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/19636/mwmaint1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/539633 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [19:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:06] T239135: Create partitioned CirrusSearchElasticaWrite topic - https://phabricator.wikimedia.org/T239135 [19:59:15] (03PS5) 10Dzahn: mediawiki::maintenance: add envoy for TLS termination for noc.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/539633 (https://phabricator.wikimedia.org/T210411) [20:05:13] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@2b713d6]: Partition CirrusSearchElasticaWrite jobs T230495 [20:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:18] T230495: Partition CirrusSearch mediawiki jobs by cluster - https://phabricator.wikimedia.org/T230495 [20:06:36] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@2b713d6]: Partition CirrusSearchElasticaWrite jobs T230495 (duration: 01m 23s) [20:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:37] (03CR) 10BBlack: [C: 03+2] authdns: port 53 monitor made optional [puppet] - 10https://gerrit.wikimedia.org/r/553150 (owner: 10BBlack) [20:18:37] (03PS1) 10BBlack: dns::recursor: move role bits down to profile [puppet] - 10https://gerrit.wikimedia.org/r/553176 (https://phabricator.wikimedia.org/T98006) [20:18:39] (03PS1) 10BBlack: dns roles/profiles refactor, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/553177 (https://phabricator.wikimedia.org/T98006) [20:19:33] (03PS1) 10Jhedden: openstack: add health check options support to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/553178 (https://phabricator.wikimedia.org/T239161) [20:21:30] (03CR) 10jerkins-bot: [V: 04-1] openstack: add health check options support to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/553178 (https://phabricator.wikimedia.org/T239161) (owner: 10Jhedden) [20:23:03] brennen: Prod clear for me to do a fun deploy? [20:23:10] (03PS6) 10Dzahn: mediawiki::maintenance: add envoy for TLS termination for noc.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/539633 (https://phabricator.wikimedia.org/T210411) [20:23:12] (03PS7) 10Jforrester: Variant configuration: Generate dblists from YAML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) [20:23:28] (03PS2) 10Jhedden: openstack: add health check options support to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/553178 (https://phabricator.wikimedia.org/T239161) [20:24:03] James_F: looking good, i'd say go ahead. [20:24:09] !log ebernhardson@deploy1001 Started deploy [search/airflow@c235ab5]: Rebuild environment for python 3.7.3 [20:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:26] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@c282e86]: Followup on T230495 [20:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:31] Cool, thanks! [20:24:31] T230495: Partition CirrusSearch mediawiki jobs by cluster - https://phabricator.wikimedia.org/T230495 [20:24:51] !log ebernhardson@deploy1001 Finished deploy [search/airflow@c235ab5]: Rebuild environment for python 3.7.3 (duration: 00m 42s) [20:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:01] (03CR) 10Dzahn: [C: 03+2] mediawiki::maintenance: add envoy for TLS termination for noc.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/539633 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [20:25:11] (03PS3) 10Jhedden: openstack: add health check options support to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/553178 (https://phabricator.wikimedia.org/T239161) [20:25:26] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@c282e86]: Followup on T230495 (duration: 00m 59s) [20:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:20] bblack: Your last merge is still on master. Merge multiple changes? yes/no ? [20:26:57] (03PS4) 10Andrew Bogott: openstack: add health check options support to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/553178 (https://phabricator.wikimedia.org/T239161) (owner: 10Jhedden) [20:26:59] (03PS13) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [20:27:21] mutante: sorry, yes, go ahead [20:27:48] (03CR) 10Jforrester: [C: 03+2] Variant configuration: Generate dblists from YAML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [20:28:27] bblack: ack, no problem. merged [20:28:41] (03Merged) 10jenkins-bot: Variant configuration: Generate dblists from YAML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545411 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [20:28:53] (03CR) 10jerkins-bot: [V: 04-1] openstack: add health check options support to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/553178 (https://phabricator.wikimedia.org/T239161) (owner: 10Jhedden) [20:29:38] (03PS5) 10Andrew Bogott: openstack: add health check options support to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/553178 (https://phabricator.wikimedia.org/T239161) (owner: 10Jhedden) [20:29:40] (03PS14) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [20:29:50] (03PS4) 10Herron: Switch Ganeti servers in esams/ulsfo to Buster [puppet] - 10https://gerrit.wikimedia.org/r/553046 (https://phabricator.wikimedia.org/T236216) (owner: 10Muehlenhoff) [20:29:53] (03CR) 10Herron: Switch Ganeti servers in esams/ulsfo to Buster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553046 (https://phabricator.wikimedia.org/T236216) (owner: 10Muehlenhoff) [20:31:06] (03PS6) 10Andrew Bogott: openstack: add health check options support to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/553178 (https://phabricator.wikimedia.org/T239161) (owner: 10Jhedden) [20:31:40] (03CR) 10jerkins-bot: [V: 04-1] openstack: add health check options support to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/553178 (https://phabricator.wikimedia.org/T239161) (owner: 10Jhedden) [20:31:51] (03PS2) 10BBlack: dns::recursor: move role bits down to profile [puppet] - 10https://gerrit.wikimedia.org/r/553176 (https://phabricator.wikimedia.org/T98006) [20:31:53] (03PS2) 10BBlack: dns roles/profiles refactor, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/553177 (https://phabricator.wikimedia.org/T98006) [20:31:55] (03PS1) 10BBlack: Draft fully-blended role [puppet] - 10https://gerrit.wikimedia.org/r/553181 [20:32:49] (03PS7) 10Jhedden: openstack: add health check options support to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/553178 (https://phabricator.wikimedia.org/T239161) [20:32:57] (03PS8) 10Andrew Bogott: openstack: add health check options support to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/553178 (https://phabricator.wikimedia.org/T239161) (owner: 10Jhedden) [20:32:59] (03PS15) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [20:33:13] !log jforrester@deploy1001 Synchronized dblists/: Update dblists, now autogenerated (no-op, just comment changes) T223602 (duration: 01m 01s) [20:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:20] T223602: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 [20:35:12] (03CR) 10jerkins-bot: [V: 04-1] openstack: add health check options support to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/553178 (https://phabricator.wikimedia.org/T239161) (owner: 10Jhedden) [20:36:24] 10Operations, 10ops-eqiad, 10decommission: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10wiki_willy) a:03Jclark-ctr [20:37:07] (03PS9) 10Andrew Bogott: openstack: add health check options support to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/553178 (https://phabricator.wikimedia.org/T239161) (owner: 10Jhedden) [20:37:09] (03PS16) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [20:38:33] (03PS2) 10BBlack: Draft fully-blended role [puppet] - 10https://gerrit.wikimedia.org/r/553181 [20:39:20] (03CR) 10Andrew Bogott: [C: 03+2] openstack: add health check options support to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/553178 (https://phabricator.wikimedia.org/T239161) (owner: 10Jhedden) [20:42:53] (03PS14) 10EBernhardson: airflow: Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) [20:43:49] (03PS3) 10BBlack: Draft fully-blended role [puppet] - 10https://gerrit.wikimedia.org/r/553181 [20:45:23] (03PS1) 10EBernhardson: Always set AIRFLOW_HOME when running airflow [puppet] - 10https://gerrit.wikimedia.org/r/553183 [20:46:03] (03CR) 10BBlack: [C: 03+2] dns::recursor: move role bits down to profile [puppet] - 10https://gerrit.wikimedia.org/r/553176 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [20:46:22] (03CR) 10BBlack: [C: 03+2] dns roles/profiles refactor, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/553177 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [20:51:50] 10Operations, 10ops-codfw: Degraded RAID on ganeti2002 - https://phabricator.wikimedia.org/T239009 (10wiki_willy) a:03Papaul Looks like the warranty expired on March 19, 2018...and we're coming up on the 5yr mark. The hardware refresh is scheduled for Q3, so let's confirm with the service owner that we can... [20:58:05] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10Marostegui) @jbond maybe it is a good idea to disable puppet on all databases before merging the change and then trying a manual run on a single host to see how it... [20:59:44] (03PS23) 10Jforrester: Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [20:59:46] (03PS6) 10Jforrester: Variant configuration: Move some all-wiki configuration from CS to all.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539436 [21:02:31] (03PS1) 10BBlack: Update cumin aliases for dns profiles [puppet] - 10https://gerrit.wikimedia.org/r/553186 (https://phabricator.wikimedia.org/T98006) [21:03:00] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [21:03:31] (03CR) 10BBlack: [C: 03+2] Draft fully-blended role [puppet] - 10https://gerrit.wikimedia.org/r/553181 (owner: 10BBlack) [21:03:42] (03PS4) 10BBlack: Draft fully-blended role [puppet] - 10https://gerrit.wikimedia.org/r/553181 [21:03:51] (03CR) 10BBlack: [V: 03+2 C: 03+2] Draft fully-blended role [puppet] - 10https://gerrit.wikimedia.org/r/553181 (owner: 10BBlack) [21:04:05] uh [21:04:23] thanks confusing gerrit UI for jumping to a different change in the series somehow as I was getting annoyed at a different change? :P [21:04:38] hmmm [21:04:42] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [21:04:54] (03CR) 10BBlack: [C: 03+2] Update cumin aliases for dns profiles [puppet] - 10https://gerrit.wikimedia.org/r/553186 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [21:05:04] anyways, I shall carry on! [21:05:37] (03CR) 10Dzahn: [C: 04-1] ""running" isn't a valid value for the $ensure parameter of systemd::service directly. see inline comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [21:09:43] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 2 others: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10DannyS712) I don't know if this is indeed done: `Request from [snip] via cp4028.ulsfo.wmnet, ATS/8.0.5 Error: 502, Cannot find server. at 2019-11-26 21:08:03 GMT` w... [21:13:21] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 2 others: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10BBlack) I think you ran into a temporary blip in some unrelated DNS work (which is already dealt with), not this bug (502 errors can happen for real infra failure r... [21:17:37] (03CR) 10RLazarus: [C: 03+2] prometheus: Scrape the poolcounter exporters. [puppet] - 10https://gerrit.wikimedia.org/r/553168 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [21:17:45] (03PS1) 10BBlack: Add dns4002 to authdns set [puppet] - 10https://gerrit.wikimedia.org/r/553189 [21:17:48] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 2 others: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content) - https://phabricator.wikimedia.org/T237319 (10CDanis) [21:18:48] (03CR) 10BBlack: [C: 03+2] Add dns4002 to authdns set [puppet] - 10https://gerrit.wikimedia.org/r/553189 (owner: 10BBlack) [21:19:26] (03PS2) 10RLazarus: prometheus: Scrape the poolcounter exporters. [puppet] - 10https://gerrit.wikimedia.org/r/553168 (https://phabricator.wikimedia.org/T237407) [21:19:34] (03PS17) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [21:19:36] (03PS1) 10Andrew Bogott: nova-placement-api: Make service port configurable [puppet] - 10https://gerrit.wikimedia.org/r/553191 (https://phabricator.wikimedia.org/T239161) [21:21:16] (03PS15) 10EBernhardson: airflow: Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) [21:21:40] (03CR) 10jerkins-bot: [V: 04-1] nova-placement-api: Make service port configurable [puppet] - 10https://gerrit.wikimedia.org/r/553191 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [21:24:01] (03PS1) 10BBlack: test commit [dns] - 10https://gerrit.wikimedia.org/r/553194 [21:24:35] (03CR) 10BBlack: [C: 03+2] test commit [dns] - 10https://gerrit.wikimedia.org/r/553194 (owner: 10BBlack) [21:25:06] (03PS2) 10Andrew Bogott: nova-placement-api: Make service port configurable [puppet] - 10https://gerrit.wikimedia.org/r/553191 (https://phabricator.wikimedia.org/T239161) [21:25:08] (03PS18) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [21:26:58] (03CR) 10jerkins-bot: [V: 04-1] nova-placement-api: Make service port configurable [puppet] - 10https://gerrit.wikimedia.org/r/553191 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [21:28:15] (03PS3) 10Andrew Bogott: nova-placement-api: Make service port configurable [puppet] - 10https://gerrit.wikimedia.org/r/553191 (https://phabricator.wikimedia.org/T239161) [21:28:17] (03PS19) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [21:29:05] (03CR) 10Andrew Bogott: "compiler diff looks right to me: https://puppet-compiler.wmflabs.org/compiler1001/19649/cloudcontrol2001-dev.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/553191 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [21:29:38] (03CR) 10EBernhardson: "This still shouldn't be merged until Iccc58193361443d2c91f74ba6fb257fc246473d3 is merged, so that `airflow initdb` can be run. Until `air" [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [21:30:15] PROBLEM - Host cp3063 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:33] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=poolcounter_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:34:48] (03PS1) 10Dzahn: ATS: use TLS to noc.wikimedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/553199 (https://phabricator.wikimedia.org/T210411) [21:40:39] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [21:41:48] (03PS1) 10Jforrester: wmf-beta-update-databases: ignore comments on all-labs.dblist [puppet] - 10https://gerrit.wikimedia.org/r/553200 [21:42:04] Anyone around to push out ^^? [21:42:14] poolcounter_exporter alert is mine -- new alerting config, not a prod issue -- looking [21:42:17] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [21:42:24] (03CR) 10jerkins-bot: [V: 04-1] wmf-beta-update-databases: ignore comments on all-labs.dblist [puppet] - 10https://gerrit.wikimedia.org/r/553200 (owner: 10Jforrester) [21:42:31] new *monitoring config rather [21:43:49] (03PS4) 10Krinkle: RejectParserCacheValue to reject possibly-corrupted entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545647 (https://phabricator.wikimedia.org/T235188) (owner: 10Anomie) [21:44:24] (03PS2) 10Jforrester: wmf-beta-update-databases: ignore comments on all-labs.dblist [puppet] - 10https://gerrit.wikimedia.org/r/553200 [21:46:12] mutante: Don't suppose I can get you to merge it? ;-) [21:46:43] (03PS5) 10Krinkle: RejectParserCacheValue to reject possibly-corrupted entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545647 (https://phabricator.wikimedia.org/T235188) (owner: 10Anomie) [21:49:11] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [21:50:49] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [21:51:15] (03CR) 10Jforrester: RejectParserCacheValue to reject possibly-corrupted entries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545647 (https://phabricator.wikimedia.org/T235188) (owner: 10Anomie) [21:54:37] (03CR) 10Dzahn: [C: 03+2] wmf-beta-update-databases: ignore comments on all-labs.dblist [puppet] - 10https://gerrit.wikimedia.org/r/553200 (owner: 10Jforrester) [21:56:04] (03PS1) 10RLazarus: prometheus: Specify metrics_path for the new poolcounter exporter, whoops. [puppet] - 10https://gerrit.wikimedia.org/r/553203 (https://phabricator.wikimedia.org/T237407) [21:56:22] (03CR) 10CDanis: [C: 03+1] prometheus: Specify metrics_path for the new poolcounter exporter, whoops. [puppet] - 10https://gerrit.wikimedia.org/r/553203 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [21:59:40] (03CR) 10RLazarus: [C: 03+2] prometheus: Specify metrics_path for the new poolcounter exporter, whoops. [puppet] - 10https://gerrit.wikimedia.org/r/553203 (https://phabricator.wikimedia.org/T237407) (owner: 10RLazarus) [22:04:30] (03PS1) 10BBlack: authdns: define gdnsd user/group explicitly [puppet] - 10https://gerrit.wikimedia.org/r/553205 [22:05:41] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, and 2 others: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10JAllemandou) @elukey: We should apply the same treatment for stat1007 :) [22:06:57] 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10RobH) [22:07:36] (03CR) 10BBlack: [C: 03+2] authdns: define gdnsd user/group explicitly [puppet] - 10https://gerrit.wikimedia.org/r/553205 (owner: 10BBlack) [22:08:01] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10RStallman-legalteam) Max's NDA is fully signed and on file w/ legal. [22:09:04] Thank you! [22:13:59] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:17:37] (03CR) 10Jhedden: [C: 03+1] nova-placement-api: Make service port configurable [puppet] - 10https://gerrit.wikimedia.org/r/553191 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [22:18:10] (03CR) 10Jhedden: [C: 03+1] nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [22:21:26] (03PS4) 10Andrew Bogott: nova-placement-api: Make service port configurable [puppet] - 10https://gerrit.wikimedia.org/r/553191 (https://phabricator.wikimedia.org/T239161) [22:23:57] (03CR) 10Andrew Bogott: [C: 03+2] nova-placement-api: Make service port configurable [puppet] - 10https://gerrit.wikimedia.org/r/553191 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [22:25:51] (03CR) 10Andrew Bogott: [C: 03+2] wmf_sink: don't commit to instance-puppet if nothing has changed [puppet] - 10https://gerrit.wikimedia.org/r/553130 (https://phabricator.wikimedia.org/T238708) (owner: 10Andrew Bogott) [22:33:14] (03PS1) 10Catrope: WelcomeSurvey: Enable for 100% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553206 (https://phabricator.wikimedia.org/T238874) [22:38:40] (03PS20) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [22:38:42] (03PS1) 10Andrew Bogott: nova-placement-api: update firewall and monitoring for new port [puppet] - 10https://gerrit.wikimedia.org/r/553207 (https://phabricator.wikimedia.org/T239161) [22:40:04] (03CR) 10Jhedden: [C: 03+1] nova-placement-api: update firewall and monitoring for new port [puppet] - 10https://gerrit.wikimedia.org/r/553207 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [22:40:50] (03CR) 10Andrew Bogott: [C: 03+2] nova-placement-api: update firewall and monitoring for new port [puppet] - 10https://gerrit.wikimedia.org/r/553207 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [22:40:53] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) Status update: the blended authdns+recdns(+ntp) role is now nearly-complete in `role::dnsbox`. There's a hieradata flag `profile::dnsbox::include_auth`... [22:48:56] (03PS21) 10Andrew Bogott: nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [22:48:58] (03PS1) 10Andrew Bogott: nova-placement-api: fix permissions on init script [puppet] - 10https://gerrit.wikimedia.org/r/553208 (https://phabricator.wikimedia.org/T239161) [22:50:01] (03CR) 10Andrew Bogott: [C: 03+2] nova-placement-api: fix permissions on init script [puppet] - 10https://gerrit.wikimedia.org/r/553208 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [22:59:59] (03PS24) 10Jforrester: Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [23:00:02] (03PS1) 10Jforrester: Make it possible to load site config from InitialiseSettings.json as well as .php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553209 [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191126T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:03:36] (03PS2) 10RLazarus: Verify SSL certs against the domain in the Host: header. [software/httpbb] - 10https://gerrit.wikimedia.org/r/551250 (https://phabricator.wikimedia.org/T236699) [23:03:39] (03PS2) 10RLazarus: Refactor the state shared between test cases into a TestHarness class. [software/httpbb] - 10https://gerrit.wikimedia.org/r/551283 (https://phabricator.wikimedia.org/T236699) [23:04:33] (03CR) 10RLazarus: [C: 03+2] Verify SSL certs against the domain in the Host: header. (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/551250 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [23:05:22] (03CR) 10RLazarus: [C: 03+2] "Yeah, agreed! Starting to look at building that out next." [software/httpbb] - 10https://gerrit.wikimedia.org/r/551283 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [23:05:56] (03Merged) 10jenkins-bot: Verify SSL certs against the domain in the Host: header. [software/httpbb] - 10https://gerrit.wikimedia.org/r/551250 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [23:06:50] (03Merged) 10jenkins-bot: Refactor the state shared between test cases into a TestHarness class. [software/httpbb] - 10https://gerrit.wikimedia.org/r/551283 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [23:13:27] (03CR) 10Dzahn: "now possible after https://gerrit.wikimedia.org/r/c/operations/puppet/+/539633" [puppet] - 10https://gerrit.wikimedia.org/r/553199 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [23:15:40] (03CR) 10Dzahn: [C: 03+2] admins: add Max Semenik as ldap_only_admin [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) (owner: 10Dzahn) [23:15:48] (03PS8) 10Dzahn: admins: add Max Semenik as ldap_only_admin [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) [23:19:24] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10Dzahn) Thanks Rachel. @maxsem I added you to the "nda" group in LDAP and also back into the "WMF-NDA" group in Phabricator. [23:21:28] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10Dzahn) 05Open→03Resolved I believe this is all that is to do here to check "beta cluster", "security issues" (you are still in the ACL g... [23:27:27] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) >>! In T184066#5694430, @mark wrote: > scs1-oe16-esams:psu1 {#20163} to ps2-oe16-esams:34 > scs1-oe16-esams:psu2 {#20164} to ps1-oe16-esams:34 > > asw2-oe16-esams:psu0 {#20162} to ps2-oe16-... [23:41:43] RECOVERY - WDQS high update lag on wdqs1009 is OK: (C)3600 ge (W)1200 ge 1161 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [23:43:53] (03CR) 10Catrope: [C: 03+2] WelcomeSurvey: Enable for 100% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553206 (https://phabricator.wikimedia.org/T238874) (owner: 10Catrope) [23:44:46] (03Merged) 10jenkins-bot: WelcomeSurvey: Enable for 100% of new users on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553206 (https://phabricator.wikimedia.org/T238874) (owner: 10Catrope) [23:55:34] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable WelcomeSurvey for 100% of new users on arwiki (duration: 01m 02s) [23:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log