[00:45:57] <wikibugs>	 10Operations, 10netops: csw2-esams's VCP link flapped - https://phabricator.wikimedia.org/T229755 (10ayounsi) 05Open→03Declined > I finished working on them but I was not able to match the digital trace to any software report like bug or PR. > When there is a core-dump alongside to an event that caused an...
[01:11:55] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=302 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=
[01:14:17] <icinga-wm>	 PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[01:15:43] <icinga-wm>	 RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 77327 bytes in 1.435 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[01:59:59] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Slaporte) @tramm I wanted to confirm that we got your email and we're looking into it. Chuck is out of office for the next few days, following his work at Wiki...
[02:00:24] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Slaporte) a:05CRoslof→03Slaporte
[02:24:02] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@adff5ad]: bulk_daemon: Handle non-integer status_code in json response
[02:24:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:28:11] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@adff5ad]: bulk_daemon: Handle non-integer status_code in json response (duration: 04m 09s)
[02:28:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:46:33] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 42910168 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:54:25] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 18848 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:52:45] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable config status check for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531018 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[03:52:55] <wikibugs>	 (03PS5) 10Vgutierrez: ATS: Disable config status check for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531018 (https://phabricator.wikimedia.org/T221594)
[03:57:11] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp4021 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:58:11] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp5001 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:58:34] <vgutierrez>	 getting rid of checks is always noisy /o\
[03:58:55] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp2022 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:01] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp2017 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:05] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp1090 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:17] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp4026 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:17] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp4022 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:19] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp2005 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:19] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp2011 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:19] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp2018 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:21] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp2024 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:31] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp1084 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:31] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp1082 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:37] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp1086 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:37] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp3044 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:37] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp3034 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:41] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp1076 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:41] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp4025 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:43] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp2002 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:43] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp2008 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:45] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp5002 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:45] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp5006 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:45] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp5004 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:45] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp5005 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:49] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp2025 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:51] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp2014 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:55] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp3045 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:59:57] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp4023 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:00:01] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp3039 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:00:01] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp3035 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:00:05] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp1078 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:00:07] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp2026 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:00:09] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp3047 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:00:11] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp1088 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:00:11] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp1080 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:00:17] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp4024 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:00:18] <wikibugs>	 (03PS10) 10CRusnov: backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900)
[04:00:19] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp3036 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:00:19] <icinga-wm>	 PROBLEM - check_trafficserver_tls_config_status on cp3046 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:01:34] <vgutierrez>	 sorry about that.. that check has been removed
[04:02:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable TCP Fast Open for the TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531027 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[04:02:26] <wikibugs>	 (03PS4) 10Vgutierrez: ATS: Enable TCP Fast Open for the TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531027 (https://phabricator.wikimedia.org/T221594)
[04:07:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov)
[04:15:11] <wikibugs>	 (03PS1) 10CRusnov: netbox: Make host private and add exception on not found [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072)
[04:23:08] <wikibugs>	 (03PS1) 10Vgutierrez: Release 8.0.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/531332
[04:34:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez)
[04:35:09] <wikibugs>	 (03PS9) 10Vgutierrez: prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217)
[04:41:17] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 490 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[04:52:27] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T230682 (10Marostegui) 05Open→03Resolved Thank you Chris! This looks good now ` root@db1063:~# megacli -LDPDInfo -aAll  Adapter #0  Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name...
[04:53:36] <wikibugs>	 (03PS1) 10Vgutierrez: prometheus: Consider the new layer label for ATS aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/531334 (https://phabricator.wikimedia.org/T221594)
[04:58:03] <wikibugs>	 (03PS1) 10Marostegui: db1122: Change binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/531335 (https://phabricator.wikimedia.org/T230785)
[04:58:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1122: Change binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/531335 (https://phabricator.wikimedia.org/T230785) (owner: 10Marostegui)
[05:01:03] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Clarify that db1122 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531336 (https://phabricator.wikimedia.org/T230785)
[05:02:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Clarify that db1122 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531336 (https://phabricator.wikimedia.org/T230785) (owner: 10Marostegui)
[05:03:01] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Clarify that db1122 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531336 (https://phabricator.wikimedia.org/T230785) (owner: 10Marostegui)
[05:03:16] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Clarify that db1122 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531336 (https://phabricator.wikimedia.org/T230785) (owner: 10Marostegui)
[05:04:31] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Clarify db1122 status: candidate master for s2 - T230785 (duration: 00m 55s)
[05:04:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:04:40] <stashbot>	 T230785: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785
[05:05:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1122 for binlog format change', diff saved to https://phabricator.wikimedia.org/P8949 and previous config saved to /var/cache/conftool/dbconfig/20190821-050501-marostegui.json
[05:05:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:05:46] <marostegui>	 !log Restart MySQL on db1122 for binlog format change - T230785
[05:05:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:14:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1122 after restart', diff saved to https://phabricator.wikimedia.org/P8950 and previous config saved to /var/cache/conftool/dbconfig/20190821-051441-marostegui.json
[05:14:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:26:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'More weight to db1122', diff saved to https://phabricator.wikimedia.org/P8951 and previous config saved to /var/cache/conftool/dbconfig/20190821-052613-marostegui.json
[05:26:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:33:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:34:17] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:45:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'More weight to db1122', diff saved to https://phabricator.wikimedia.org/P8952 and previous config saved to /var/cache/conftool/dbconfig/20190821-054542-marostegui.json
[05:45:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:46:33] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] dumps::web::htmldumps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531226 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[05:48:31] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] dumps::generation::server: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531212 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[05:49:05] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] dumps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531276 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[06:12:13] <wikibugs>	 10Operations, 10Discovery-Search (Current work): Run jstack / jmap / etc... with PrivateTmp=true - https://phabricator.wikimedia.org/T230774 (10Joe) p:05Triage→03Normal
[06:24:38] <wikibugs>	 (03PS2) 10Muehlenhoff: Setup partman config for puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/531190
[06:28:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Setup partman config for puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/531190 (owner: 10Muehlenhoff)
[06:29:27] <elukey>	 I just checked the OSPF alarms, is related to the Zayo circuit between cr2-codfw and cr2-eqiad. There is maintenance scheduled
[06:29:43] <elukey>	 so everything seems ok :)
[06:39:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/531211 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[06:43:47] <icinga-wm>	 RECOVERY - snapshot of s7 in codfw on db1115 is OK: snapshot for s7 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-08-21 03:47:27 from db2100.codfw.wmnet:3317 (850 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[06:44:08] <wikibugs>	 (03CR) 10Muehlenhoff: backup::ofsite: add ipv6 mapped address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531233 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[06:56:12] <wikibugs>	 (03CR) 10Muehlenhoff: "Let's split this and first take the mwdebug servers and canaries." [puppet] - 10https://gerrit.wikimedia.org/r/531256 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[07:00:55] <wikibugs>	 (03CR) 10Muehlenhoff: "I don't see a patch for redis/eqiad lined up or maybe just not added with reviewers yet?" [puppet] - 10https://gerrit.wikimedia.org/r/531267 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[07:09:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/531266 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[07:11:58] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch::relforge: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531271 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[07:24:51] <wikibugs>	 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10Joe)
[07:30:46] <wikibugs>	 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10MoritzMuehlenhoff) We maintain custom 7.2 packages anyway (based on the 7.2.x releases), we can cherrypick the patch for our package upd...
[07:41:03] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] ATS: enable compress.so for upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/530823 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema)
[07:44:24] <wikibugs>	 (03PS2) 10Ema: ATS: enable compress.so for upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/530823 (https://phabricator.wikimedia.org/T227432)
[07:45:08] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: enable compress.so for upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/530823 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema)
[07:47:41] <wikibugs>	 (03CR) 10Ema: [C: 03+1] Swap analytics-tool1002 with an-tool1007 in caching config [puppet] - 10https://gerrit.wikimedia.org/r/531154 (https://phabricator.wikimedia.org/T230709) (owner: 10Elukey)
[07:53:49] <wikibugs>	 10Operations, 10DBA: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 (10Marostegui)
[07:53:52] <wikibugs>	 10Operations, 10DBA: Decommission db2059.codfw.wmnet - https://phabricator.wikimedia.org/T230884 (10Marostegui)
[07:53:57] <wikibugs>	 10Operations, 10DBA: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Marostegui)
[07:54:12] <wikibugs>	 10Operations, 10DBA: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 (10Marostegui) p:05Triage→03Normal
[07:54:23] <wikibugs>	 10Operations, 10DBA: Decommission db2059.codfw.wmnet - https://phabricator.wikimedia.org/T230884 (10Marostegui) p:05Triage→03Normal
[07:54:29] <wikibugs>	 10Operations, 10DBA: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Marostegui) p:05Triage→03Normal
[07:55:16] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[07:56:45] <ema>	 !log upload@eqsin: rolling ats-backend-restart to enable compress plugin
[07:56:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1122', diff saved to https://phabricator.wikimedia.org/P8953 and previous config saved to /var/cache/conftool/dbconfig/20190821-075813-marostegui.json
[07:58:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:42] <wikibugs>	 10Operations, 10DBA: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 (10Marostegui) Binary log format changed on db1122, host upgraded and rebooted.
[08:00:17] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2052 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531362 (https://phabricator.wikimedia.org/T230883)
[08:02:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2052 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531362 (https://phabricator.wikimedia.org/T230883) (owner: 10Marostegui)
[08:02:43] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db2052 [puppet] - 10https://gerrit.wikimedia.org/r/531380 (https://phabricator.wikimedia.org/T230883)
[08:02:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Let's go for these two hosts after the test ones" [puppet] - 10https://gerrit.wikimedia.org/r/531203 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[08:03:12] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2052 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531362 (https://phabricator.wikimedia.org/T230883) (owner: 10Marostegui)
[08:03:30] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2052 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531362 (https://phabricator.wikimedia.org/T230883) (owner: 10Marostegui)
[08:04:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2052 [puppet] - 10https://gerrit.wikimedia.org/r/531380 (https://phabricator.wikimedia.org/T230883) (owner: 10Marostegui)
[08:04:37] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2052 from config T230883 (duration: 00m 54s)
[08:04:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:46] <stashbot>	 T230883: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883
[08:05:39] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2052 from config T230883 (duration: 00m 54s)
[08:05:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:29] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 (10Marostegui)
[08:11:28] <marostegui>	 !log Remove db2052 from tendril and zarcillo T230883
[08:11:34] <marostegui>	 !log Stop MySQL on db2052 T230883
[08:11:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:36] <stashbot>	 T230883: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883
[08:11:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:33] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 (10Marostegui) a:05Marostegui→03RobH
[08:12:54] <wikibugs>	 (03PS1) 10Hashar: Remove role::ci::slave::webperformance [puppet] - 10https://gerrit.wikimedia.org/r/531420 (https://phabricator.wikimedia.org/T225416)
[08:12:56] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 (10Marostegui) This host is ready for #dc-ops to decommission
[08:13:09] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[08:16:28] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10MoritzMuehlenhoff) Did the technician replace the mainboard?
[08:18:27] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:18:47] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:18:56] <moritzm>	 !log installing puppetdb2002
[08:19:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:44] <wikibugs>	 (03PS2) 10Elukey: Add base configuration for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531277 (https://phabricator.wikimedia.org/T227025)
[08:21:05] <wikibugs>	 (03PS2) 10Ema: varnishlog: request/response headers to send to logstash [puppet] - 10https://gerrit.wikimedia.org/r/520425 (https://phabricator.wikimedia.org/T189333)
[08:21:28] <wikibugs>	 (03PS1) 10Tarrow: Termbox Staging Test - Allow numeric characters in language codes [deployment-charts] - 10https://gerrit.wikimedia.org/r/531426
[08:21:30] <wikibugs>	 (03PS1) 10Tarrow: Termbox Staging - Allow numeric characters in language codes [deployment-charts] - 10https://gerrit.wikimedia.org/r/531427
[08:21:32] <wikibugs>	 (03PS1) 10Tarrow: Termbox codfw - Allow numeric characters in language codes [deployment-charts] - 10https://gerrit.wikimedia.org/r/531428
[08:21:34] <wikibugs>	 (03PS1) 10Tarrow: Termbox eqiad - Allow numeric characters in language codes [deployment-charts] - 10https://gerrit.wikimedia.org/r/531429
[08:25:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This first needs a sudo rule that will allow mwdeploy to restart php-fpm" [puppet] - 10https://gerrit.wikimedia.org/r/530014 (https://phabricator.wikimedia.org/T224857) (owner: 10Thcipriani)
[08:25:46] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531215 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[08:25:54] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531216 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[08:26:13] <wikibugs>	 (03CR) 10Hashar: "I cant tell what are the impact of adding ipv6 to thumbor sorry :\" [puppet] - 10https://gerrit.wikimedia.org/r/531278 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[08:27:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/530014/ needs to be merged before we proceed further." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529924 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli)
[08:27:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add base configuration for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531277 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey)
[08:28:31] <elukey>	 new zookeeper nodes for analytics --^
[08:28:34] <elukey>	 \o/
[08:29:04] <elukey>	 hopefully we'll move analytics-related znodes away from conf* soon
[08:29:07] <moritzm>	 !log upgrading PHP on contint*
[08:29:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto)
[08:32:34] <wikibugs>	 (03PS3) 10Ema: varnishlog: request/response headers to send to logstash [puppet] - 10https://gerrit.wikimedia.org/r/520425 (https://phabricator.wikimedia.org/T189333)
[08:34:27] <wikibugs>	 (03CR) 10Ema: [V: 03+2 C: 03+2] varnishlog: request/response headers to send to logstash [puppet] - 10https://gerrit.wikimedia.org/r/520425 (https://phabricator.wikimedia.org/T189333) (owner: 10Ema)
[08:36:48] <wikibugs>	 (03PS4) 10Gehel: wdqs: restrict port 8888 to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/530856 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe)
[08:37:39] <wikibugs>	 (03PS1) 10Tarrow: Enable Termbox on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531433 (https://phabricator.wikimedia.org/T230896)
[08:38:50] <wikibugs>	 (03PS2) 10Tarrow: Enable Termbox on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531433 (https://phabricator.wikimedia.org/T230896)
[08:38:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add debian package build [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517979 (owner: 10Giuseppe Lavagetto)
[08:39:39] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] wdqs: restrict port 8888 to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/530856 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe)
[08:40:35] <icinga-wm>	 PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[08:41:03] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Wikimedia-Incident: Logstash gets significantly lower number of messages from mediawiki - https://phabricator.wikimedia.org/T230847 (10fgiunchedi) p:05Unbreak!→03Normal Downgrading to normal since the situation has stabilized and returned to normal as 23:40 UTC, stil...
[08:41:10] <wikibugs>	 (03CR) 10Tarrow: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531433 (https://phabricator.wikimedia.org/T230896) (owner: 10Tarrow)
[08:48:29] <wikibugs>	 (03CR) 10Jakob: [C: 03+2] "This change is ready for review." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/531426 (owner: 10Tarrow)
[08:48:43] <wikibugs>	 (03PS1) 10Elukey: Add AAAA/A/PTR records for an-conf100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/531435 (https://phabricator.wikimedia.org/T227025)
[08:51:30] <wikibugs>	 (03PS2) 10Elukey: Add AAAA/A/PTR records for an-conf100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/531435 (https://phabricator.wikimedia.org/T227025)
[08:51:33] <elukey>	 anybody up for a quick DNS review?
[08:51:36] <elukey>	 --^
[08:51:42] <elukey>	 (new hosts)
[08:51:48] <moritzm>	 lookinh
[08:52:26] <elukey>	 thanksss
[08:53:59] <wikibugs>	 (03CR) 10Jakob: [V: 03+2 C: 03+2] Termbox Staging Test - Allow numeric characters in language codes [deployment-charts] - 10https://gerrit.wikimedia.org/r/531426 (owner: 10Tarrow)
[08:56:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/531435 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey)
[08:56:29] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Wikimedia-Incident: Logstash gets significantly lower number of messages from mediawiki - https://phabricator.wikimedia.org/T230847 (10fgiunchedi) Looks like a logstash consumer has failed, according to kafka logs on logstash1010  ` [2019-08-20 23:09:36,042] INFO [GroupC...
[08:57:40] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'test' .
[08:57:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add AAAA/A/PTR records for an-conf100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/531435 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey)
[08:59:38] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] Add maps reboot cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe)
[09:00:04] <jouncebot>	 tarrow and jakob_wmde: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Initial deployment of the new mobile termbox on Wikidata . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T0900).
[09:07:23] <_joe_>	 !log uploaded python-poolcounter to stretch,buster
[09:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:45] <wikibugs>	 (03CR) 10Jakob: [V: 03+2 C: 03+2] "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/531427 (owner: 10Tarrow)
[09:09:29] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'staging' .
[09:09:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:43] <icinga-wm>	 RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[09:09:49] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[09:09:49] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:09:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "lgtm, one small nit (optional)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531291 (owner: 10CDanis)
[09:14:57] <wikibugs>	 (03CR) 10Jakob: [V: 03+2 C: 03+2] "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/531428 (owner: 10Tarrow)
[09:15:25] <logmsgbot>	 !log @ helmfile [CODFW] Ran 'apply' command on namespace 'termbox' for release 'production' .
[09:15:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:15] <wikibugs_>	 (03PS18) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072)
[09:18:28] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe)
[09:19:13] <wikibugs_>	 (03CR) 10Jakob: [V: 03+2 C: 03+2] "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/531429 (owner: 10Tarrow)
[09:20:39] <logmsgbot>	 !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'termbox' for release 'production' .
[09:20:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "These are passive, so let's deploy there and check" [puppet] - 10https://gerrit.wikimedia.org/r/531262 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[09:24:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] dumps::web::htmldumps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531226 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[09:24:34] <tarrow>	 Right, we're enabling termbox on wikidatawiki now
[09:24:38] <wikibugs>	 (03PS2) 10Jbond: dumps::web::htmldumps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531226 (https://phabricator.wikimedia.org/T102099)
[09:25:04] <marostegui>	 tarrow: what's the expected impact of that?
[09:25:24] <wikibugs>	 (03PS19) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072)
[09:25:28] <tarrow>	 some actual load on the termbox service
[09:25:45] <tarrow>	 a little (really a drop in the ocean) more load on the api appservers
[09:26:01] <wikibugs>	 (03CR) 10Jakob: [C: 03+2] Enable Termbox on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531433 (https://phabricator.wikimedia.org/T230896) (owner: 10Tarrow)
[09:26:02] <marostegui>	 tarrow: gotcha, thanks :-)
[09:26:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] dumps::generation::server: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531212 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[09:26:20] <wikibugs>	 (03CR) 10Mathew.onipe: Add maps reboot cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe)
[09:26:24] <wikibugs>	 (03PS2) 10Jbond: dumps::generation::server: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531212 (https://phabricator.wikimedia.org/T102099)
[09:27:20] <tarrow>	 and there might be a small increase in the PCache size (again we expect this to be small)
[09:27:26] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Termbox on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531433 (https://phabricator.wikimedia.org/T230896) (owner: 10Tarrow)
[09:28:22] <wikibugs>	 (03CR) 10jenkins-bot: Enable Termbox on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531433 (https://phabricator.wikimedia.org/T230896) (owner: 10Tarrow)
[09:28:33] <marostegui>	 tarrow: thank you :)
[09:28:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] dumps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531276 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[09:29:06] <wikibugs>	 (03PS2) 10Jbond: dumps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531276 (https://phabricator.wikimedia.org/T102099)
[09:29:15] <moritzm>	 !log rebooting db2102 (reverting to a proper stretch 4.9 kernel, it used a bpo kernel due to some hardware debuging a while back)
[09:29:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:05] <marostegui>	 moritzm: did you downtime it?
[09:30:20] <moritzm>	 yeah and stopped mariadb.service
[09:30:23] <marostegui>	 \o/
[09:30:46] <moritzm>	 for mariabd::core_test roles I also need to manually start mariadb.service when it's back, right?
[09:30:52] <marostegui>	 yeah
[09:30:56] <moritzm>	 ack
[09:30:56] <marostegui>	 I can do that once it is back if you want
[09:30:59] <marostegui>	 and start replication
[09:31:07] <marostegui>	 moritzm: just let me know when it is back up
[09:31:08] <wikibugs>	 (03CR) 10Mobrovac: [C: 03+1] restbase: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531272 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[09:31:14] <moritzm>	 ack, I'll ping you when it's booted to the correct kernel
[09:31:18] <marostegui>	 thanks
[09:32:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] debmonitor: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531211 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[09:32:16] <wikibugs>	 (03PS2) 10Jbond: debmonitor: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531211 (https://phabricator.wikimedia.org/T102099)
[09:34:25] <moritzm>	 marostegui: it's up and I've purged the old backports kernel, so further reboots won't need manual selection of the 4.9 kernel in Grub
[09:34:37] <marostegui>	 great! I will take care of mysql
[09:34:38] <marostegui>	 thanks
[09:34:42] <wikibugs>	 (03PS2) 10Jbond: backup::offsite: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531233 (https://phabricator.wikimedia.org/T102099)
[09:34:57] <wikibugs>	 (03CR) 10Jbond: backup::offsite: add ipv6 mapped address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531233 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[09:36:25] <logmsgbot>	 !log tarrow@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:531433|Enable Termbox on wikidatawiki (T230896)]] (duration: 00m 55s)
[09:36:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:33] <stashbot>	 T230896: Enable Termbox on wikidatawiki - https://phabricator.wikimedia.org/T230896
[09:38:34] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/531267 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[09:39:39] <wikibugs>	 (03PS2) 10Jbond: puppetboard/puppetdb: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531266 (https://phabricator.wikimedia.org/T102099)
[09:40:55] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: hiera: fix the hierarchical order of lookups [puppet] - 10https://gerrit.wikimedia.org/r/475500 (owner: 10Giuseppe Lavagetto)
[09:41:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetboard/puppetdb: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531266 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[09:46:59] <tarrow>	 !log finished enabling termbox on wikidatawiki
[09:47:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:52] <wikibugs>	 (03PS2) 10Jbond: mariadb::parsercache - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531262 (https://phabricator.wikimedia.org/T102099)
[09:54:45] <jijiki>	 jouncebot: next
[09:54:45] <jouncebot>	 In 1 hour(s) and 5 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T1100)
[09:56:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] mariadb::parsercache - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531262 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[10:00:43] <wikibugs>	 (03PS3) 10Jbond: mariadb::core_multiinstance - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531164 (https://phabricator.wikimedia.org/T102099)
[10:01:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] mariadb::core_multiinstance - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531164 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[10:02:04] <moritzm>	 !log installing puppetdb1002
[10:02:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "These hosts are passive, so let's deploy there" [puppet] - 10https://gerrit.wikimedia.org/r/531209 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[10:06:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb::temporary_storage: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531217 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[10:07:17] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[10:08:49] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[10:09:28] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui)
[10:10:30] <godog>	 yeah definitely we need some tweaks on the latency alerts, a bit too touchy now even if true
[10:11:12] <wikibugs>	 (03PS2) 10Jbond: MW servers - eqiad (canary and debug): add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531256 (https://phabricator.wikimedia.org/T102099)
[10:11:14] <wikibugs>	 (03PS1) 10Jbond: MW servers - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531453 (https://phabricator.wikimedia.org/T102099)
[10:12:23] <wikibugs>	 (03PS3) 10Jbond: MW servers - eqiad (canary and debug): add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531256 (https://phabricator.wikimedia.org/T102099)
[10:12:31] <icinga-wm>	 PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[10:12:35] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/531256 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[10:12:57] <jijiki>	 godog: +1
[10:14:29] <wikibugs>	 (03PS2) 10Jbond: mariadb::temporary_storage: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531217 (https://phabricator.wikimedia.org/T102099)
[10:15:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531256 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[10:15:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] mariadb::temporary_storage: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531217 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[10:23:22] <wikibugs>	 (03PS1) 10Muehlenhoff: puppetdb/postgres: Drop support for jessie, add support for buster [puppet] - 10https://gerrit.wikimedia.org/r/531454
[10:25:08] <wikibugs>	 (03PS3) 10Jbond: mariadb::proxy - codfw:  add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531209 (https://phabricator.wikimedia.org/T102099)
[10:25:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] mariadb::proxy - codfw:  add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531209 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[10:27:16] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update s8-master record [dns] - 10https://gerrit.wikimedia.org/r/531455 (https://phabricator.wikimedia.org/T230762)
[10:27:27] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/531455 (https://phabricator.wikimedia.org/T230762) (owner: 10Marostegui)
[10:28:02] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db1109 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/531189 (https://phabricator.wikimedia.org/T230762)
[10:28:12] <wikibugs>	 (03PS4) 10Marostegui: mariadb: Promote db1133 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/529331 (https://phabricator.wikimedia.org/T229657)
[10:28:19] <wikibugs>	 (03PS3) 10Marostegui: wmnet: Promote db1133 to m5 master [dns] - 10https://gerrit.wikimedia.org/r/529333 (https://phabricator.wikimedia.org/T229657)
[10:28:53] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=404 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[10:34:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] MW servers - eqiad (canary and debug): add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531256 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[10:34:34] <wikibugs>	 (03PS4) 10Jbond: MW servers - eqiad (canary and debug): add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531256 (https://phabricator.wikimedia.org/T102099)
[10:35:50] <jbond42>	 heads up im deploying the ipv6 mapped change to the mw canary and debug hosts.  i dont expect any impact https://gerrit.wikimedia.org/r/c/operations/puppet/+/531256
[10:35:53] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver
[10:40:15] <icinga-wm>	 RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[10:42:16] <wikibugs>	 (03PS2) 10Jbond: restbase: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531272 (https://phabricator.wikimedia.org/T102099)
[10:42:17] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[10:45:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] restbase: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531272 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[10:45:48] <Urbanecm>	 !log Run mwscript namespaceDupes.php --wiki=zhwikisource --add-prefix=FIXME --fix (T230548)
[10:45:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:57] <stashbot>	 T230548: Shortcut namespace redirect on zhwikisource - https://phabricator.wikimedia.org/T230548
[10:47:34] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 23 probes of 449 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[10:49:35] <Urbanecm>	 !log Wrapped code added to CommonSettings.php in T230601 to wgExtensionFunctions
[10:49:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:43] <stashbot>	 T230601: Groups 'oversight'/'suppress' should be reconciled - https://phabricator.wikimedia.org/T230601
[10:49:54] <Urbanecm>	 !log Previous log entry was for mwdebug1002
[10:50:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531454 (owner: 10Muehlenhoff)
[10:52:43] <Urbanecm>	 !log Move 0a87e3c's code to abusefilter.php on mwdebug1002 (T230601)
[10:52:48] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] elasticsearch::relforge: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531271 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[10:52:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:29] <wikibugs>	 (03PS2) 10Jbond: elasticsearch::relforge: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531271 (https://phabricator.wikimedia.org/T102099)
[10:54:05] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Configuration, 10discovery-system: Allow use of EtcdConfig to configure slave databases - https://phabricator.wikimedia.org/T185084 (10Marostegui) @Joe can this be considered done already with `dbctl`?
[10:55:16] <wikibugs>	 10Operations, 10MediaWiki-Configuration, 10discovery-system: Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597 (10Joe)
[10:55:19] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Configuration, 10discovery-system: Allow use of EtcdConfig to configure slave databases - https://phabricator.wikimedia.org/T185084 (10Joe) 05Open→03Resolved a:03Joe Indeed! we're doing more than this!
[10:57:05] <Urbanecm>	 !log Run scap pull on mwdebug1002 (T230601)
[10:57:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:13] <stashbot>	 T230601: Groups 'oversight'/'suppress' should be reconciled - https://phabricator.wikimedia.org/T230601
[10:57:48] <wikibugs>	 10Operations, 10MediaWiki-Configuration, 10discovery-system: Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597 (10Marostegui)
[10:57:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] elasticsearch::relforge: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531271 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, and Urbanecm: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T1100).
[11:00:04] <jouncebot>	 alaa_wmde: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:05:49] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "A minor thing and a nit, see inline." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov)
[11:06:11] <alaa_wmde>	 hello there :) I've got a config change, any deployer up for it?
[11:11:25] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/17967/" [puppet] - 10https://gerrit.wikimedia.org/r/531454 (owner: 10Muehlenhoff)
[11:11:37] <wikibugs>	 (03PS2) 10Muehlenhoff: puppetdb/postgres: Drop support for jessie, add support for buster [puppet] - 10https://gerrit.wikimedia.org/r/531454
[11:12:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb::misc::phabricator - codfw: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/531195 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[11:13:11] <Amir1>	 alaa_wmde: let me do it
[11:13:47] <alaa_wmde>	 thanks @Amir1 
[11:15:20] <wikibugs>	 (03PS1) 10Elukey: Add new partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531466 (https://phabricator.wikimedia.org/T227025)
[11:15:32] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:15:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] puppetdb/postgres: Drop support for jessie, add support for buster [puppet] - 10https://gerrit.wikimedia.org/r/531454 (owner: 10Muehlenhoff)
[11:16:01] <Urbanecm>	 Amir1: thanks!
[11:16:09] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531162 (owner: 10Alaa Sarhan)
[11:16:23] <Urbanecm>	 Wow, so many reverts
[11:16:44] <wikibugs>	 (03PS2) 10Elukey: Add new partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531466 (https://phabricator.wikimedia.org/T227025)
[11:17:01] <Amir1>	 Urbanecm: it's actually more, Alaa forked it
[11:17:18] <Amir1>	 I have one small issue, I can't move between panes in tmux
[11:17:19] <wikibugs>	 (03PS3) 10Elukey: Add new partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531466 (https://phabricator.wikimedia.org/T227025)
[11:17:39] <Amir1>	 my up arrow key is broken...
[11:19:15] <Amir1>	 I need a couple of minutes to remap my keyboard
[11:19:41] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add new partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531466 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey)
[11:20:14] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:20:56] <Urbanecm>	 Amir1: I bought an external keyboard for that purpose
[11:21:08] <Urbanecm>	 (i doesn't work on my normal keyboard)
[11:22:22] <alaa_wmde>	 > Wow, so many reverts
[11:22:22] <alaa_wmde>	 yeap it is the switch that flips the table upside-down on terms store... and clients weren't fully tested unfortunately leading to some bugs being discovered only in production
[11:22:53] <Urbanecm>	 Got it :)
[11:24:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] mariadb::misc::tendril: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531203 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[11:24:08] <wikibugs>	 (03PS2) 10Jbond: mariadb::misc::tendril: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531203 (https://phabricator.wikimedia.org/T102099)
[11:24:46] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:27:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] elasticsearch::cirrus - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531215 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[11:27:21] <wikibugs>	 (03PS2) 10Jbond: elasticsearch::cirrus - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531215 (https://phabricator.wikimedia.org/T102099)
[11:27:27] <Amir1>	 made some progress will be back soon
[11:27:54] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:28:28] <wikibugs>	 (03PS1) 10Elukey: Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025)
[11:28:58] <elukey>	 moritzm: --^
[11:29:36] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluste
[11:29:36] <icinga-wm>	 ethod=GET
[11:29:45] <moritzm>	 looking
[11:31:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline, but feel free to ignore." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey)
[11:32:18] <wikibugs>	 (03CR) 10Elukey: Use standard partman recipe for an-conf100[1-3] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey)
[11:33:26] <wikibugs>	 (03PS2) 10Elukey: Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025)
[11:33:55] <wikibugs>	 (03PS3) 10Elukey: Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025)
[11:34:24] <icinga-wm>	 PROBLEM - Check systemd state on ores1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:34:28] <icinga-wm>	 PROBLEM - Check systemd state on ores1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:34:30] <icinga-wm>	 PROBLEM - Check systemd state on ores1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:34:50] <jijiki>	 ^ checking 
[11:34:55] <elukey>	 celery-ores-worker.service
[11:34:56] <jijiki>	 unless someone knows something 
[11:35:02] <jijiki>	 tx 
[11:35:15] <elukey>	 nono please go
[11:35:22] <elukey>	 I am checking as well
[11:35:58] <icinga-wm>	 RECOVERY - Check systemd state on ores1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:35:58] <wikibugs>	 (03PS2) 10Jbond: elasticsearch::cirrus - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531216 (https://phabricator.wikimedia.org/T102099)
[11:36:21] <jijiki>	 I think it restarted itself 
[11:36:35] <Amir1>	 Urbanecm: can you do it? my up key doesn't work and I couldn't make it to work
[11:36:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] elasticsearch::cirrus - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531216 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[11:36:50] <elukey>	 not on 1002 (I am on it)
[11:37:21] <Urbanecm>	 Amir1: well I'm on mobile, I'm sorry
[11:37:38] <icinga-wm>	 RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:37:49] <elukey>	 !log restart celery-ores-worker on ores1002 
[11:37:51] <jijiki>	 elukey: should we just start them ?
[11:37:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:13] <elukey>	 I think so yes, can't find why they died in the logs though
[11:38:15] <Amir1>	 hmm, I can buy a keyboard fairly quickly but we need to extend the swat a bit
[11:38:29] <jijiki>	 !log Restarting ores on ores1004 and ores1005
[11:38:30] <Urbanecm>	 I can probably deploy it, but i'm not sure i would be also able to revert it should it be needed 
[11:38:33] <alaa_wmde>	 lol no no we can do it tomorrow too
[11:38:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:39] <jbond42>	 elukey: jijiki i have started 1005
[11:38:41] <alaa_wmde>	 no need to stress and rush into buying a keyboard now
[11:38:53] <Amir1>	 alaa_wmde: are you sure? I need the keyboard anyway :D
[11:39:00] <Urbanecm>	 In that case, /me votes for rescheduling :D
[11:39:04] <jijiki>	 jbond42: oh I had no idea
[11:39:08] <icinga-wm>	 RECOVERY - Check systemd state on ores1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:39:20] <jijiki>	 maybe we all overreacted here :p
[11:39:51] <jbond42>	 sorry i had only just come to the party and elu.key said to restart and i was there
[11:40:08] <alaa_wmde>	 Amir1: yeah we can do that tomorrow .. maybe I do my first deployment while you're watching over my shoulders ;)
[11:40:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey)
[11:41:33] <elukey>	 well celery down out of the blue for ores is not really great :D
[11:41:40] <elukey>	 better to restart than leaving it broken
[11:41:59] <Amir1>	 !log EU SWAT is done
[11:42:03] <Amir1>	 alaa_wmde: suer!
[11:42:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:16] <Urbanecm>	 alaa_wmde: you have deployment privs?
[11:42:27] <wikibugs>	 (03PS4) 10Elukey: Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025)
[11:42:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp: remove ldap[0].providerClass [puppet] - 10https://gerrit.wikimedia.org/r/530409 (owner: 10Jbond)
[11:42:34] <wikibugs>	 (03PS2) 10Jbond: idp: remove ldap[0].providerClass [puppet] - 10https://gerrit.wikimedia.org/r/530409
[11:42:56] <alaa_wmde>	 Urbanecm: I must have them by now yes
[11:44:01] <wikibugs>	 (03PS2) 10Jbond: mariadb::misc::phabricator - codfw: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/531195 (https://phabricator.wikimedia.org/T102099)
[11:44:10] <wikibugs>	 (03PS5) 10Elukey: Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025)
[11:44:29] <Urbanecm>	 alaa_wmde: so you can do it yourself next time, good to know :d
[11:44:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] mariadb::misc::phabricator - codfw: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/531195 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[11:44:50] <elukey>	 so now we should figure out why celery went down
[11:46:02] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey)
[11:46:09] <wikibugs>	 (03PS6) 10Elukey: Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025)
[11:46:11] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey)
[11:49:16] <alaa_wmde>	 Urbanecm: yeap once I do my first deployment (or first few with someone by my side for quick questions) then you'll probably see me lurking here a lot more often :)
[11:49:58] <Urbanecm>	 Good!
[11:50:07] <Urbanecm>	 Feel free to ask questions if you have any
[11:50:44] <elukey>	 Amir1: o/ - do you have any idea why celery on some ores100X nodes could shutdown?
[11:52:04] <jbond42>	 elukey: i see this in the logs https://phabricator.wikimedia.org/P8955 on at least 1005 and 1002
[11:52:52] <elukey>	 jbond42: I checked the code and it doesn't seem to abruptively cause a shutdown, and the rest seems more or less what celery emits when shuttting down afaics..
[11:52:55] <elukey>	 really strange
[11:53:03] <elukey>	 plus are we missing alarms for celery?
[11:54:12] <jbond42>	 yes i dont really know anything about celery im afraid
[11:54:22] <elukey>	 me too :(
[11:56:10] <awight>	 One edge case we've documented is that any connectivity glitch between Celery and ORES's Redis cluster will cause the celery service to go zombie.
[11:56:31] <elukey>	 awight: o/
[11:56:38] <elukey>	 go zombie means shutting down ?
[11:56:44] <awight>	 There's a watchdog which restarts every 15 minutes or something.
[11:57:14] <awight>	 elukey: I thought I remember them continuing to run but never accepting further jobs?  But shutting down is also a failure we saw, maybe caused by something else.
[11:57:29] <awight>	 Would be good to review the memory usage graphs, IMO.
[11:58:11] <elukey>	 all right opening a task :)
[11:58:33] <wikibugs>	 (03PS2) 10Jbond: mw servers - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531255 (https://phabricator.wikimedia.org/T102099)
[11:59:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] mw servers - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531255 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[11:59:48] <jbond42>	 !log add ipv6 mapped address to mw codfw servers
[11:59:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T1200)
[12:01:07] <wikibugs>	 (03PS2) 10Effie Mouzeli: mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392)
[12:01:15] <wikibugs>	 (03PS1) 10Effie Mouzeli: mediawiki::users: Allow adding privileges via profiles [puppet] - 10https://gerrit.wikimedia.org/r/531474
[12:01:18] <wikibugs>	 (03PS1) 10Effie Mouzeli: mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857)
[12:02:48] <awight>	 Amir1: fwiw, ^-A <tab> will switch panes I believe
[12:04:46] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] Log dnsblacklist entries at info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531299 (https://phabricator.wikimedia.org/T230822) (owner: 10Urbanecm)
[12:06:50] <awight>	 This doesn't look healthy, https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?refresh=1m&panelId=25&fullscreen&orgId=1&from=now-7d&to=now-1m
[12:06:59] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,3,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey)
[12:07:06] <elukey>	 there you go --^
[12:07:47] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey)
[12:12:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "You should append "_layer" to the resulting metric, to indicate layer is there too now. Also consider that this will create new metrics an" [puppet] - 10https://gerrit.wikimedia.org/r/531334 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[12:12:50] <wikibugs>	 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) @Cmjohnson I was able to install the OS on an-conf1001 via manual PXE install, but I had to set in the BIOS the following serial console setting: `S...
[12:22:45] <awight>	 elukey: Thank you!
[12:23:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "Prometheus hosts already have ipv6 afaics" [puppet] - 10https://gerrit.wikimedia.org/r/531264 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[12:23:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531243 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[12:24:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] swift - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531251 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[12:24:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] swift - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531252 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[12:28:18] <wikibugs>	 (03CR) 10Muehlenhoff: "Already present in profile::lvs" [puppet] - 10https://gerrit.wikimedia.org/r/531244 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[12:29:00] <wikibugs>	 (03CR) 10Muehlenhoff: "Already present in role::installserver" [puppet] - 10https://gerrit.wikimedia.org/r/531237 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[12:38:02] <wikibugs>	 (03PS2) 10Effie Mouzeli: mediawiki::users: Allow adding privileges to mwdeploy user [puppet] - 10https://gerrit.wikimedia.org/r/531474
[12:44:58] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 04-1 C: 04-1] "Something is very wrong" [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) (owner: 10Effie Mouzeli)
[12:45:11] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 04-1 C: 04-1] "Something  is very  wrong" [puppet] - 10https://gerrit.wikimedia.org/r/531474 (owner: 10Effie Mouzeli)
[12:46:42] <wikibugs>	 (03PS2) 10Effie Mouzeli: mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857)
[12:50:38] <wikibugs>	 10Operations, 10serviceops: Update component/php72 to 7.2.21 - https://phabricator.wikimedia.org/T230024 (10MoritzMuehlenhoff)
[12:52:36] <wikibugs>	 10Operations, 10serviceops: Update component/php72 to 7.2.21 - https://phabricator.wikimedia.org/T230024 (10MoritzMuehlenhoff) I'm running into a build failure, which I initially assumed was caused by DNS resolution in pbuilder/boron, but it's ultimately caused by MariaDB; the build calls mysql_install_db from...
[13:00:04] <jouncebot>	 zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - European version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T1300).
[13:00:46] <wikibugs>	 (03PS3) 10Effie Mouzeli: mediawiki::users: Allow adding privileges to mwdeploy user [puppet] - 10https://gerrit.wikimedia.org/r/531474
[13:00:58] <zeljkof>	 thank you jouncebot for the reminder
[13:01:00] <wikibugs>	 (03PS3) 10Effie Mouzeli: mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857)
[13:01:06] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[13:01:39] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2059 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531480 (https://phabricator.wikimedia.org/T230884)
[13:02:40] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[13:05:20] <wikibugs>	 (03PS4) 10Effie Mouzeli: mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857)
[13:07:06] <wikibugs>	 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10tstarling) Cherry pick is not exactly the right word, I'm just proposing a temporary hack so that it will maybe work, whereas PHP 7.3 do...
[13:08:03] <wikibugs>	 (03PS1) 10Zfilipin: group1 wikis to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531481
[13:08:05] <wikibugs>	 (03CR) 10Zfilipin: [C: 03+2] group1 wikis to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531481 (owner: 10Zfilipin)
[13:09:19] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531481 (owner: 10Zfilipin)
[13:09:54] <wikibugs>	 (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531481 (owner: 10Zfilipin)
[13:11:02] <logmsgbot>	 !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.19
[13:11:22] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver
[13:11:37] <jijiki>	 ^ downtime expired  
[13:11:42] <jijiki>	 that is mine 
[13:11:58] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.19 (duration: 00m 55s)
[13:12:02] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=404 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[13:12:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:16] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:16:42] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,400} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method
[13:17:21] <wikibugs>	 (03PS2) 10Elukey: Add more tunables to Eventlogging to Druid [puppet] - 10https://gerrit.wikimedia.org/r/531046
[13:19:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17972/" [puppet] - 10https://gerrit.wikimedia.org/r/531046 (owner: 10Elukey)
[13:19:54] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/531264 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[13:20:07] <wikibugs>	 (03Abandoned) 10Jbond: prometheus: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531264 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[13:20:26] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:20:57] <wikibugs>	 (03Abandoned) 10Jbond: lvs::balancer: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531244 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[13:21:59] <wikibugs>	 (03Abandoned) 10Jbond: installserver: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531237 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[13:22:27] <wikibugs>	 (03PS4) 10CDanis: deployment: add fix-staging-perms command & sudo for it [puppet] - 10https://gerrit.wikimedia.org/r/531291
[13:22:31] <wikibugs>	 (03PS2) 10Jbond: logstash: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531243 (https://phabricator.wikimedia.org/T102099)
[13:23:33] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] deployment: add fix-staging-perms command & sudo for it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531291 (owner: 10CDanis)
[13:24:04] <wikibugs>	 (03PS1) 10Tchanders: Enable special mute on beta for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531484
[13:24:28] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,400} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method
[13:25:24] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver
[13:26:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] logstash: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531243 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[13:26:52] <wikibugs>	 (03CR) 10Dmaza: [C: 03+1] Enable special mute on beta for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531484 (owner: 10Tchanders)
[13:27:31] <wikibugs>	 (03PS5) 10CDanis: deployment: add fix-staging-perms command & sudo for it [puppet] - 10https://gerrit.wikimedia.org/r/531291
[13:27:36] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[13:27:43] <wikibugs>	 (03CR) 10CDanis: [V: 03+2 C: 03+2] deployment: add fix-staging-perms command & sudo for it [puppet] - 10https://gerrit.wikimedia.org/r/531291 (owner: 10CDanis)
[13:28:39] <wikibugs>	 (03PS3) 10Elukey: Add more tunables to Eventlogging to Druid [puppet] - 10https://gerrit.wikimedia.org/r/531046
[13:28:42] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add more tunables to Eventlogging to Druid [puppet] - 10https://gerrit.wikimedia.org/r/531046 (owner: 10Elukey)
[13:29:57] <wikibugs>	 (03PS2) 10Jbond: swift - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531251 (https://phabricator.wikimedia.org/T102099)
[13:31:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] swift - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531251 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[13:33:18] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Enable special mute on beta for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531484 (owner: 10Tchanders)
[13:34:11] <wikibugs>	 (03PS1) 10Nuria: Removing loading of Reading_Depth into druid [puppet] - 10https://gerrit.wikimedia.org/r/531489 (https://phabricator.wikimedia.org/T229042)
[13:34:40] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver
[13:35:06] <wikibugs>	 (03Merged) 10jenkins-bot: Enable special mute on beta for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531484 (owner: 10Tchanders)
[13:37:12] <wikibugs>	 (03PS2) 10Nuria: Removing loading of Reading_Depth into druid [puppet] - 10https://gerrit.wikimedia.org/r/531489 (https://phabricator.wikimedia.org/T229042)
[13:37:24] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] swift - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531252 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[13:37:58] <wikibugs>	 (03PS2) 10Jbond: swift - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531252 (https://phabricator.wikimedia.org/T102099)
[13:39:22] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver
[13:40:46] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:43:32] <wikibugs>	 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10MoritzMuehlenhoff) Ack, let me know when you have found a suitable value for  GC_ROOT_BUFFER_MAX_ENTRIES, I have the 7.2.21 update for s...
[13:43:50] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:44:02] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver
[13:44:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] swift - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531252 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[13:47:08] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver
[13:48:34] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:50:29] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "@Smalyshev: do you have a list of URLs that can be used to test this change?" [puppet] - 10https://gerrit.wikimedia.org/r/526755 (owner: 10Smalyshev)
[13:51:48] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver
[13:51:57] <wikibugs>	 (03CR) 10jenkins-bot: Enable special mute on beta for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531484 (owner: 10Tchanders)
[13:53:14] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[13:53:35] <Amir1>	 I think I haven't rebased my patch in deploy1001. Has anyone deployed anything since the SWAT?
[13:54:14] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) It looks to me like all of this log output is actually from celery starting back up.    I wo...
[13:55:25] <halfak>	 o/ elukey 
[13:55:41] <halfak>	 How did you notice that celery was down on those nodes?
[13:56:42] <elukey>	 halfak: o/ I opened a task about it, we have an alarm on all nodes running systemd that alerts when units are failed
[13:56:45] <elukey>	 (one or more)
[13:57:03] <halfak>	 Gotcha.  Thank you for looking into it.  I'm writing a task now about putting monitoring in place. 
[14:00:04] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey)
[14:01:02] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey) >>! In T230917#5428548, @Halfak wrote: > It looks to me like all of this log output is actua...
[14:01:50] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey)
[14:03:27] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) On ores1002, I see the following in app.log:  ` 2019-08-21 11:31:10,673 ERROR celery.worker....
[14:05:22] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) I see the same error on ores1006.  But celery is clearly still running there.
[14:05:55] <halfak>	 elukey, just to confirm, you did not restart celery on any nodes other than ores100[2,4,5] right?
[14:06:26] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=404 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[14:06:37] <elukey>	 halfak: correct
[14:07:20] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver
[14:07:27] <halfak>	 Aha!  It looks like some of the machines eventually hit a "timeout" error talking to redis -- which is recoverable.
[14:07:39] <elukey>	 ah nice!
[14:07:42] <elukey>	 where is the app.log?
[14:07:47] <elukey>	 I tried to find it but didn't
[14:08:01] <halfak>	  /srv/log/ores/app.log
[14:08:06] <elukey>	 ahhh
[14:08:10] <elukey>	 I always forget
[14:08:27] <elukey>	 Cc: jbond42 since he was working on it too
[14:08:31] <elukey>	 (as FYI)
[14:09:42] <jbond42__>	 thanks
[14:10:39] <marostegui>	 !log Upgrade mysql on db2075
[14:10:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:08] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=404 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[14:17:22] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=404 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[14:19:32] <godog>	 ok php7 404s performance during deploys looks like is terrible
[14:20:13] <godog>	 I'm leaning towards further restricting the alerts to status <400, not sure if there's any actionable now otherwise
[14:20:29] <Krinkle>	 Why would 4xx raise post-deploy?
[14:20:34] <godog>	 talking about https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1566386427498&to=1566397227498&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=404&panelId=9&fullscreen
[14:20:38] <godog>	 that I have no idea
[14:20:46] <wikibugs>	 (03PS2) 10Elukey: Swap analytics-tool1002 with an-tool1007 in caching config [puppet] - 10https://gerrit.wikimedia.org/r/531154 (https://phabricator.wikimedia.org/T230709)
[14:21:04] <godog>	 Krinkle: not the count of 4xx but their latency to be exact
[14:22:00] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=404 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[14:22:46] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Swap analytics-tool1002 with an-tool1007 in caching config [puppet] - 10https://gerrit.wikimedia.org/r/531154 (https://phabricator.wikimedia.org/T230709) (owner: 10Elukey)
[14:26:36] <zeljkof>	 Krinkle: which browser did you say to use for mediawiki-new-errors? Chrome?
[14:26:51] <zeljkof>	 for editing that dashboard
[14:27:36] <moritzm>	 !log installing ca-certificates-java update from Stretch point release
[14:27:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:14] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=404 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[14:28:50] <elukey>	 !log swap turnilo backend in varnish from analytics-tool1002 to an-tool1007
[14:28:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:58] <elukey>	 new version of turnilo :)
[14:29:12] <wikibugs>	 (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor)
[14:29:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor)
[14:29:43] <godog>	 !log silence average mw appserver latency alerts for 24h, too noisy
[14:29:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:28] <Krinkle>	 zeljkof: yeah, when I said that it only worked properly in Chrome
[14:30:48] <Krinkle>	 I think Firefox quantum (65 and later) can handle it as well
[14:31:11] <zeljkof>	 Krinkle: thanks, I was using firefox 68 but it was really slow, chrome works better
[14:31:36] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) But on ores1006, the top-level error is:  ` redis.exceptions.TimeoutError: Timeout reading f...
[14:33:11] <wikibugs>	 (03PS1) 10Urbanecm: Revert "Revert "Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531501 (https://phabricator.wikimedia.org/T230797)
[14:33:29] <wikibugs>	 (03PS1) 10Ayounsi: Depool codfw for routers work [dns] - 10https://gerrit.wikimedia.org/r/531502 (https://phabricator.wikimedia.org/T226422)
[14:34:44] <zeljkof>	 Krinkle: I've just made the first edit to mediawiki-new-errors, from https://logstash.wikimedia.org/goto/62dcb0a9efdf79f2d14d5bd806962174 to https://logstash.wikimedia.org/goto/e480f1755ad9183ea62c20e6ef7cd424, hopefully I didn't break anything
[14:35:35] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "Rebase, please:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor)
[14:36:23] <moritzm>	 !log installing dns-root-data update from Stretch point release
[14:36:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:14] <wikibugs>	 10Operations, 10MediaWiki-General: Elevated php7 latency during mw deploy - https://phabricator.wikimedia.org/T230934 (10fgiunchedi)
[14:39:22] <godog>	 filed ^ btw, didn't seem quite normal
[14:42:14] <moritzm>	 !log installing java-common update from Stretch point release
[14:42:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:35] <Daimona>	 Urbanecm: hi, around?
[14:45:42] <Urbanecm>	 Daimona: yes
[14:46:11] <logmsgbot>	 !log elukey@deploy1001 Started deploy [analytics/superset/deploy@868635a]: Upgrading superset to 0.34rc1
[14:46:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:15] <Daimona>	 Good! Would it be possible for you to grep all existing on-wiki scripts?
[14:46:44] <logmsgbot>	 !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@868635a]: Upgrading superset to 0.34rc1 (duration: 00m 33s)
[14:46:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:25] <Urbanecm>	 Daimona: certainly
[14:47:38] <Daimona>	 Cool
[14:47:51] <Daimona>	 I'd like to know if any script contains: abuseFilterBoxName
[14:48:25] <Urbanecm>	 Daimona: I guess no need to paste that as an NDA-only paste, right?
[14:48:35] <Daimona>	 Indeed :)
[14:48:51] <Daimona>	 It's a JS global that was removed
[14:48:59] <Daimona>	 I don't think anything is using that, but better safe
[14:49:05] <Urbanecm>	 thanks
[14:49:21] <Daimona>	 Thank you :)
[14:49:44] <Urbanecm>	 https://www.irccloud.com/pastebin/x3C2hpyC/
[14:49:53] <Urbanecm>	 seems you're right Daimona :))
[14:50:06] <Daimona>	 Eheh thanks :D
[14:50:44] <Urbanecm>	 yw
[14:52:56] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531280 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[14:53:37] <wikibugs>	 (03PS2) 10Jbond: wqds: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531280 (https://phabricator.wikimedia.org/T102099)
[14:54:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wqds: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531280 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[14:59:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] MW servers - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531453 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[15:00:24] <wikibugs>	 (03PS2) 10Jbond: MW servers - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531453 (https://phabricator.wikimedia.org/T102099)
[15:00:46] <jbond42>	 !log adding interface::add_ip6_mapped to media wiki servers
[15:00:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] MW servers - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531453 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[15:05:52] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::php: remove legacy php restart script [puppet] - 10https://gerrit.wikimedia.org/r/531508
[15:05:53] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: safe-service-restart: add the ability to just depool or repool a server [puppet] - 10https://gerrit.wikimedia.org/r/531509
[15:05:55] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510
[15:06:23] <wikibugs>	 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff)
[15:07:18] <moritzm>	 !log installing python-cryptography update from Stretch point release
[15:07:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] safe-service-restart: add the ability to just depool or repool a server [puppet] - 10https://gerrit.wikimedia.org/r/531509 (owner: 10Giuseppe Lavagetto)
[15:08:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: remove legacy php restart script [puppet] - 10https://gerrit.wikimedia.org/r/531508 (owner: 10Giuseppe Lavagetto)
[15:08:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 (owner: 10Giuseppe Lavagetto)
[15:09:26] <wikibugs>	 (03PS1) 10Ayounsi: Varnish: redirect eqsin/ulsfo text to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/531513 (https://phabricator.wikimedia.org/T226422)
[15:09:58] <wikibugs>	 (03PS2) 10Jbond: grafana: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531230 (https://phabricator.wikimedia.org/T102099)
[15:11:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] grafana: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531230 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[15:11:49] <wikibugs>	 (03PS5) 10Effie Mouzeli: mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857)
[15:12:58] <wikibugs>	 (03Abandoned) 10Mholloway: Machine vision (beta): Configure Wikidata Beta item URL template [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530575 (owner: 10Mholloway)
[15:13:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) (owner: 10Effie Mouzeli)
[15:14:56] <wikibugs>	 (03PS2) 10Jbond: debug_proxy: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531231 (https://phabricator.wikimedia.org/T102099)
[15:15:02] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] maps - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531246 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[15:15:38] <logmsgbot>	 !log elukey@deploy1001 Started deploy [analytics/superset/deploy@UNKNOWN]: Rollback to 0.32
[15:15:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] debug_proxy: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531231 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[15:15:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:03] <logmsgbot>	 !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@UNKNOWN]: Rollback to 0.32 (duration: 00m 25s)
[15:16:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:20] <wikibugs>	 (03PS6) 10Effie Mouzeli: mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857)
[15:18:12] <wikibugs>	 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff)
[15:18:50] <wikibugs>	 (03PS2) 10CRusnov: netbox: Make host private and add exception on not found [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072)
[15:20:16] <wikibugs>	 (03CR) 10CRusnov: "Fixed!" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov)
[15:21:38] <wikibugs>	 (03PS2) 10Jbond: maps - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531245 (https://phabricator.wikimedia.org/T102099)
[15:21:50] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1] "Expected output https://puppet-compiler.wmflabs.org/compiler1002/17977/" [puppet] - 10https://gerrit.wikimedia.org/r/531474 (owner: 10Effie Mouzeli)
[15:22:53] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1] "Expected https://puppet-compiler.wmflabs.org/compiler1002/17976/" [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) (owner: 10Effie Mouzeli)
[15:27:09] <wikibugs>	 (03PS11) 10Ladsgroup: mediawiki: Use mediawiki::errorpage instead of a hhvm-fatal-error.php.erb [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114)
[15:28:50] <Amir1>	 Anyone to review this ^ 
[15:31:02] <wikibugs>	 (03CR) 10Smalyshev: "https://commons.wikimedia.org/entity/statement/M40538870-D3B1F1D8-C2E4-4C7B-B562-721D6C94CF25 should be a good one." [puppet] - 10https://gerrit.wikimedia.org/r/526755 (owner: 10Smalyshev)
[15:35:38] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] maps - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531245 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[15:36:56] <wikibugs>	 (03PS2) 10Ayounsi: Depool codfw and eqsin for codfw routers work [dns] - 10https://gerrit.wikimedia.org/r/531502 (https://phabricator.wikimedia.org/T226422)
[15:38:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] maps - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531246 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[15:38:47] <wikibugs>	 (03PS2) 10Jbond: maps - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531246 (https://phabricator.wikimedia.org/T102099)
[15:39:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki::users: Allow adding privileges to mwdeploy user [puppet] - 10https://gerrit.wikimedia.org/r/531474 (owner: 10Effie Mouzeli)
[15:41:24] <wikibugs>	 (03PS2) 10Jbond: graphite: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531236 (https://phabricator.wikimedia.org/T102099)
[15:42:24] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: safe-service-restart: add the ability to just depool or repool a server [puppet] - 10https://gerrit.wikimedia.org/r/531509
[15:42:26] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510
[15:43:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] graphite: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531236 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[15:43:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 (owner: 10Giuseppe Lavagetto)
[15:44:37] <wikibugs>	 (03PS1) 10CRusnov: netbox: Add method to return host information [software/spicerack] - 10https://gerrit.wikimedia.org/r/531521
[15:45:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) (owner: 10Effie Mouzeli)
[15:46:15] <wikibugs>	 (03CR) 10CDanis: dbctl: add note & candidate_master fields (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/529396 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis)
[15:47:36] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "@nuria: The patch has been merged on June 26 :)" [puppet] - 10https://gerrit.wikimedia.org/r/519181 (https://phabricator.wikimedia.org/T226035) (owner: 10Elukey)
[15:48:04] <wikibugs>	 (03PS2) 10Jbond: webserver_misc_apps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531239 (https://phabricator.wikimedia.org/T102099)
[15:49:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] webserver_misc_apps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531239 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[15:53:00] <wikibugs>	 (03PS2) 10Jbond: openldap: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531242 (https://phabricator.wikimedia.org/T102099)
[15:53:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] openldap: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531242 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[15:56:05] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov)
[15:56:30] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] netbox: Make host private and add exception on not found [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov)
[15:58:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] swap: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531258 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[15:58:35] <wikibugs>	 (03PS2) 10Jbond: swap: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531258 (https://phabricator.wikimedia.org/T102099)
[16:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T1600).
[16:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:00:44] <wikibugs>	 (03Merged) 10jenkins-bot: netbox: Make host private and add exception on not found [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov)
[16:01:11] <moritzm>	 !log fixed apt config on krypton, broken getenvoy-jessie.list made apt-get update fail
[16:01:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:53] <wikibugs>	 (03CR) 10jenkins-bot: netbox: Make host private and add exception on not found [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov)
[16:16:58] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:17:40] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:18:33] <urandom>	 moritzm: what does r531275 do?
[16:18:48] <urandom>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/531275
[16:19:31] <urandom>	 or jbond42 ^^^
[16:20:22] <jbond42>	 urandom: currently theses servers have a SLAAC ipv6 address which looks like ipv6 = $prefix:$mac_address.  this change updates the server so that it will have ipv6 = $prefix:$ipv4
[16:20:41] <jbond42>	 inbound connections should be unaffected as the address is not in dns
[16:20:46] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:21:07] <urandom>	 jbond42: I see
[16:21:13] <jbond42>	 outbound connections would get a different ipv6 source address however this should not causes issues as the SLAAC address is not configuered anywhere (well no where in puppet)
[16:21:38] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:21:39] <jbond42>	 i have allready deployed to a number of services and im pretty confident it should cause no issues
[16:21:40] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] sessionstore: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531275 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[16:21:50] <jbond42>	 thanks :)
[16:26:40] <bblack>	 apparently the 5xx spike was brief, but it seems to have affected eqsin, ulsfo, and codfw, but not eqiad or esams.
[16:26:56] <bblack>	 (eqsin and ulsfo flow through codfw, forming one side of the world so to speak)
[16:28:35] <wikibugs>	 (03PS1) 10Elukey: profile::superset::proxy: add X-Forwarded-Proto "http" [puppet] - 10https://gerrit.wikimedia.org/r/531526 (https://phabricator.wikimedia.org/T230416)
[16:30:03] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230442 (10Jclark-ctr) Received replacement SSD 1.9t {F30052557}
[16:30:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::superset::proxy: add X-Forwarded-Proto "http" [puppet] - 10https://gerrit.wikimedia.org/r/531526 (https://phabricator.wikimedia.org/T230416) (owner: 10Elukey)
[16:36:02] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Depool codfw and eqsin for codfw routers work [dns] - 10https://gerrit.wikimedia.org/r/531502 (https://phabricator.wikimedia.org/T226422) (owner: 10Ayounsi)
[16:37:04] <XioNoX>	 !log depool eqsin and codfw - T226422
[16:37:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:11] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[16:38:27] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/531521 (owner: 10CRusnov)
[16:41:40] <leszek_wmde>	 howdy, anyone up for deploying a small backport?
[16:42:05] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] netbox: Add method to return host information [software/spicerack] - 10https://gerrit.wikimedia.org/r/531521 (owner: 10CRusnov)
[16:43:18] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230442 (10Bstorm) Did Dell only send replacement SSD?  This has lost 4 disks in a very short time (all are failed now and most missing in the list of disks).  I highly suspect there is...
[16:44:24] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 55.44 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:45:58] <XioNoX>	 that's expected ^
[16:46:02] <wikibugs>	 (03Merged) 10jenkins-bot: netbox: Add method to return host information [software/spicerack] - 10https://gerrit.wikimedia.org/r/531521 (owner: 10CRusnov)
[16:46:26] <XioNoX>	 !log apply BGP graceful shutdown to cr1-codfw transits - T226422
[16:46:28] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 52.99 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:46:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:32] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[16:47:02] <wikibugs>	 (03CR) 10jenkins-bot: netbox: Add method to return host information [software/spicerack] - 10https://gerrit.wikimedia.org/r/531521 (owner: 10CRusnov)
[16:48:04] <leszek_wmde>	 Amir1: so, you're deploying https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/531528 ? Asking as there is another volunteer, so we want to avoid stepping on each other's toes
[16:51:12] <XioNoX>	 !log increase OSPF cost on ulsfo-codfw link - T226422
[16:51:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:13] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Varnish: redirect eqsin/ulsfo text to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/531513 (https://phabricator.wikimedia.org/T226422) (owner: 10Ayounsi)
[16:55:21] <wikibugs>	 (03PS2) 10Ayounsi: Varnish: redirect eqsin/ulsfo text to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/531513 (https://phabricator.wikimedia.org/T226422)
[16:56:22] <XioNoX>	 !log  Varnish: redirect eqsin/ulsfo text to eqiad - T226422
[16:56:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:28] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[16:56:46] <tarrow>	 Amir1: I will do it assuming that you're busy
[16:57:01] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Cmjohnson) @Bstorm can you try rebooting the server and see if the disks get back to the correct order.   I know that works for analytics.  Please tr...
[16:57:44] <tarrow>	 Any objections to me SWATing in a patch, I guess I might blow the window a bit but there is still quite a while until the next calendar entry?
[17:00:09] <tarrow>	 !log continuing the SWAT window to backport train blocker fixes
[17:00:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:22] <cmjohnson1>	 !log rebooting cloudvirt1024
[17:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:37] <wikibugs>	 (03PS2) 10Jbond: sessionstore: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531275 (https://phabricator.wikimedia.org/T102099)
[17:08:47] <XioNoX>	 !log disable BGP from cr1-codfw to lvs2001/2/3 - T226422
[17:08:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:53] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[17:15:07] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) Yup, I can do that.  I'm not sure which either, per T230442#5429068 It dropped the failures from the list, and I'm not even entirely convince...
[17:15:25] <Amir1>	 tarrow: Thanks. UBNs can go in at any time
[17:15:34] <Amir1>	 by definition of UBN
[17:15:51] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Cmjohnson) The disk was replaced but from what I can tell is that the raid configuration is not accepting the new disk. When I am in the raid utility...
[17:16:45] <tarrow>	 cool
[17:17:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sessionstore: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531275 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[17:17:04] <tarrow>	 just waiting for our good friend jenkins
[17:17:11] <XioNoX>	 !log failover master RE to RE1 on cr1-codfw - T226422
[17:17:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:16] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[17:17:44] <bstorm_>	 !log reboot cloudvirt1024 to try and reset raid T230289
[17:17:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:50] <stashbot>	 T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289
[17:20:27] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) It wasn't showing the right number of disks when I was running things.  It was missing four, I believe?  Two have failed and logged tickets,...
[17:21:26] <XioNoX>	 linecards restarted as expected and coming online
[17:22:24] <wikibugs>	 (03PS2) 10Jbond: thumbor: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531278 (https://phabricator.wikimedia.org/T102099)
[17:23:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] thumbor: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531278 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[17:24:19] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) Reboot sent it into a re-image (stalled at confirmation about writing partitioning scheme to disk).  It's not healthy. :)  Feel free to muck...
[17:24:38] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:25:03] <jbond42>	 XioNoX: is this you ^^
[17:25:16] <XioNoX>	 yes
[17:25:19] <XioNoX>	 eqsin is depooled
[17:25:26] <jbond42>	 ack ok thx
[17:25:26] <XioNoX>	 probably due to the transport link flap
[17:25:50] <XioNoX>	 !log shutdown RE0 on cr1-codfw - T226422
[17:25:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:56] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[17:28:17] <wikibugs>	 (03PS2) 10Jbond: etherpad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531225 (https://phabricator.wikimedia.org/T102099)
[17:29:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] etherpad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531225 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[17:29:18] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:33:28] <cmjohnson1>	 !log cloudvirt1015 down for a new motherboard 
[17:33:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:33] <XioNoX>	 !log failover master RE to RE0 on cr1-codfw - T226422
[17:33:40] <wikibugs>	 (03Abandoned) 10Jbond: spare::system: add ipv6 mapped addres [puppet] - 10https://gerrit.wikimedia.org/r/531157 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[17:33:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:42] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[17:34:45] <XioNoX>	 Not ready for mastership switch, try after 107 secs.
[17:34:46] <XioNoX>	 haha
[17:35:54] <wikibugs>	 (03PS2) 10Jbond: logging::mediawiki::udp2log: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531257 (https://phabricator.wikimedia.org/T102099)
[17:36:44] <icinga-wm>	 PROBLEM - Host cloudvirt1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:39:10] <XioNoX>	 bah, RE0 didn't pickup the config from RE1
[17:42:05] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Removing mgmt dns for californium [dns] - 10https://gerrit.wikimedia.org/r/531295 (https://phabricator.wikimedia.org/T189921) (owner: 10Cmjohnson)
[17:42:12] <XioNoX>	 solved
[17:42:17] <wikibugs>	 (03PS2) 10Cmjohnson: Removing mgmt dns for californium [dns] - 10https://gerrit.wikimedia.org/r/531295 (https://phabricator.wikimedia.org/T189921)
[17:42:27] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10Nuria) 05Open→03Resolved
[17:42:29] <XioNoX>	     warning: Chassis configuration for network services has been changed. A system reboot is mandatory.  Please reboot *ALL* routing engines NOW. Continuing without a reboot might result in unexpected system behavior.
[17:42:30] <wikibugs>	 10Operations, 10Analytics, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Nuria)
[17:42:36] <wikibugs>	 (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] Removing mgmt dns for californium [dns] - 10https://gerrit.wikimedia.org/r/531295 (https://phabricator.wikimedia.org/T189921) (owner: 10Cmjohnson)
[17:42:38] <XioNoX>	 that was not in the docs
[17:42:47] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Removing mgmt dns entries for logstash1000[4-6] [dns] - 10https://gerrit.wikimedia.org/r/531293 (https://phabricator.wikimedia.org/T217556) (owner: 10Cmjohnson)
[17:42:51] <wikibugs>	 (03PS2) 10Cmjohnson: Removing mgmt dns entries for logstash1000[4-6] [dns] - 10https://gerrit.wikimedia.org/r/531293 (https://phabricator.wikimedia.org/T217556)
[17:42:55] <wikibugs>	 (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] Removing mgmt dns entries for logstash1000[4-6] [dns] - 10https://gerrit.wikimedia.org/r/531293 (https://phabricator.wikimedia.org/T217556) (owner: 10Cmjohnson)
[17:43:02] <XioNoX>	 !log restart both REs on cr1-codfw - T226422
[17:43:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:07] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[17:43:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] logging::mediawiki::udp2log: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531257 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond)
[17:45:20] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 71.11 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:45:45] <XioNoX>	 wow, so quick to reboot
[17:47:22] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 70.34 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:48:02] <icinga-wm>	 PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 64, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:48:04] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:49:52] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] "Thanks for the follow up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531501 (https://phabricator.wikimedia.org/T230797) (owner: 10Urbanecm)
[17:50:14] <XioNoX>	 linecards take much more time to come back up
[17:50:23] <XioNoX>	 but they are up now
[17:50:36] <XioNoX>	 everything looks good, starting to repool everything
[17:50:50] <XioNoX>	 I mean make cr1 primary for everything
[17:51:10] <icinga-wm>	 RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 66, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:51:39] <wikibugs>	 10Operations, 10DBA, 10StructuredDiscussions, 10Growth-Team (Current Sprint), 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10JTannerWMF) a:03Tgr
[17:52:46] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:53:23] <XioNoX>	 !log rollback: disable BGP from cr1-codfw to lvs2001/2/3 - T226422
[17:53:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:29] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[17:55:03] <XioNoX>	 !log Rollback: increase OSPF cost on ulsfo-codfw link - T226422
[17:55:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:04] <XioNoX>	 !log rollback: apply BGP graceful shutdown to cr1-codfw transits - T226422
[17:56:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:34] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) copied from T230442#5413070 `                     Versions                 ================ Product Name    : PERC H730P Adapter Serial No...
[17:57:04] <James_F>	 tarrow: I've got a train blocker to deploy once you're done, BTW.
[17:57:32] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:58:37] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10Jclark-ctr)
[17:59:21] <tarrow>	 James_F: awesome, doing this blocker first
[17:59:27] <tarrow>	 just a moment :)
[17:59:35] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10Jclark-ctr)
[17:59:43] <James_F>	 Of course, no rush. :-)
[18:00:11] <tarrow>	 We just waited 1hr on jenkins... :P
[18:00:19] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10Jclark-ctr)
[18:00:34] <James_F>	 Yeah, it's almost like Wikidata has too many complex tests. ;-P
[18:00:38] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:02:46] <tarrow>	 James_F: I see some up checked in changes in .19 am I still good to do a backport?
[18:03:19] <tarrow>	 specifically to includes/resourceloader/ResourceLoaderWikiModule.php
[18:03:54] <tarrow>	 not security commits or something but I guess an actual manually tweaked file
[18:03:59] * James_F looks.
[18:04:02] <Krinkle>	 there have been chmod issues
[18:04:23] <Krinkle>	 which can make a file modified that isn't due to git's internal state not being reflected on disk
[18:04:32] <XioNoX>	 !log increase OSPF cost on cr2-codfw links - T226422
[18:04:34] <tarrow>	 ah! right
[18:04:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:37] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[18:04:58] <tarrow>	 Krinkle: so should I be waiting until that's fixed? Or should I go ahead anyway?
[18:05:01] <Krinkle>	 tarrow: fixed. Yes, there are still chmod issues it seems.
[18:05:19] <tarrow>	 ahl cool :)
[18:08:07] <tarrow>	 James_F: at the risk of asking a stupid question it looks like somehow you back port is ahead of mine...
[18:08:30] <tarrow>	 can I get you to look before I break something?
[18:08:45] * James_F looks.
[18:09:05] <James_F>	 Yours is 1cc697615c3bd13bc3a46fffdb1da3e94cccc9cc, mine is "behind" that.
[18:09:46] <James_F>	 If you scap "just" the ext/Wikibase dir it'll be fine from my POV.
[18:09:54] <wikibugs>	 10Operations, 10Elasticsearch, 10Traffic, 10Discovery-Search (Current work), 10Patch-For-Review: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10debt) 05Open→03Resolved
[18:09:56] <tarrow>	 ok :)
[18:10:56] <leszek_wmde>	 checking
[18:11:37] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) https://www.dell.com/support/home/en/en/sebsdt1/drivers/driversdetails?driverid=f675y Looks like there's a number of fixes on this update of...
[18:12:15] <leszek_wmde>	 tarrow: looks good
[18:12:20] <tarrow>	 great!
[18:12:25] <XioNoX>	 !log deactivate transit links on cr2-codfw - T226422
[18:12:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:30] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[18:14:33] <tarrow>	 syncing now...
[18:14:45] <logmsgbot>	 !log ayounsi@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw
[18:14:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:52] <logmsgbot>	 !log ayounsi@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=swift,name=codfw
[18:14:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:20] <logmsgbot>	 !log tarrow@deploy1001 Synchronized php-1.34.0-wmf.19/extensions/Wikibase/client/: SWAT: [[gerrit:531528|Use the backwards-compatible HTML ID for the wikidata item link (T66315)]] (duration: 00m 58s)
[18:15:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:26] <stashbot>	 T66315: Move "Data item" link outside of sidebar toolbox - https://phabricator.wikimedia.org/T66315
[18:15:47] <logmsgbot>	 !log ayounsi@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw
[18:15:50] <leszek_wmde>	 tarrow: also looking good on non-debug host
[18:15:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:53] <tarrow>	 James_F: we're done :) Thanks for the help
[18:15:53] <logmsgbot>	 !log ayounsi@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=swift,name=codfw
[18:15:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:00] <leszek_wmde>	 tarrow: muchos gracias
[18:16:18] <James_F>	 Always. On behalf of RelEng, thanks for fixing. :-)
[18:16:57] <tarrow>	 We have one more UBN to fix but I think we will do it tomorrow morning with fresh eyes so we don't make things worse
[18:17:27] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Cmjohnson) Board arrived DOA...need another one
[18:17:53] <XioNoX>	 !log move VRRP master from cr2-codfw to cr1-codfw - T226422
[18:18:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:11] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[18:18:55] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.19/includes/specialpage/RedirectSpecialPage.php: T230932 RedirectSpecialArticle: Fix PHP notice about undefined index (duration: 00m 54s)
[18:19:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:00] <stashbot>	 T230932: RedirectSpecialArticle.php: PHP Notice: Undefined index: action - https://phabricator.wikimedia.org/T230932
[18:19:26] <XioNoX>	 !log shutdown re1:cr2-codfw (backup) - T226422
[18:19:28] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[18:19:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:39] <James_F>	 tarrow: Is that part of T230937 or is that a different task?
[18:21:39] <stashbot>	 T230937: TermboxView.php: Call to a member function getSerialization() on a non-object (null) - https://phabricator.wikimedia.org/T230937
[18:30:44] <XioNoX>	 waiting for RE1 to be 100% online
[18:31:04] <Urbanecm>	 jouncebot: next
[18:31:05] <jouncebot>	 In 1 hour(s) and 28 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T2000)
[18:32:23] <XioNoX>	 !log failover master RE to RE1 on cr2-codfw - T226422
[18:32:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:29] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[18:32:46] <icinga-wm>	 PROBLEM - Juniper alarms on cr2-codfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.193 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[18:33:24] <icinga-wm>	 RECOVERY - Host cloudvirt1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms
[18:36:30] <icinga-wm>	 PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 64, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:37:03] <XioNoX>	 !log shutdown re0:cr2-codfw (backup) - T226422
[18:37:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:56] <icinga-wm>	 RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 66, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:44:33] <tarrow>	 James_F: sorry just got back to the hotel. Yes it is a solution for T230937. Patch is up but we didn't backport it yet. I think I'd be happiest to have slept on it/merge it tomorrow
[18:44:34] <stashbot>	 T230937: TermboxView.php: Call to a member function getSerialization() on a non-object (null) - https://phabricator.wikimedia.org/T230937
[18:48:26] <wikibugs>	 10Operations: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10sbassett)
[18:49:08] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10Reedy)
[18:49:23] <wikibugs>	 10Operations, 10netops: PCI Gap Assessment auditor question about SNMP - https://phabricator.wikimedia.org/T230952 (10Jgreen)
[18:52:20] <wikibugs>	 10Operations, 10netops: PCI Gap Assessment auditor question about SNMP - https://phabricator.wikimedia.org/T230952 (10ayounsi) 05Open→03Resolved a:03ayounsi SNMPv2c Read Only.  Easy task! :)
[19:14:13] <XioNoX>	 !log failover master RE to RE0 on cr2-codfw - T226422
[19:14:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:20] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[19:16:25] <XioNoX>	 !log restart both REs on cr2-codfw - T226422
[19:16:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:54] <icinga-wm>	 PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 64, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:19:14] <XioNoX>	 waiting for linecards to bootup
[19:19:20] <XioNoX>	 but so far everything looks healthy
[19:22:14] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs2005 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=codfw+prometheus/ops
[19:22:34] <icinga-wm>	 RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 66, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:22:38] <icinga-wm>	 RECOVERY - Juniper alarms on cr2-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[19:22:48] <XioNoX>	 alright all good
[19:23:13] <XioNoX>	 things are re-establishing as expected
[19:23:19] <XioNoX>	 devices look healthy
[19:23:46] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs2005 is OK: (C)0 le (W)0 le 1 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=codfw+prometheus/ops
[19:24:53] <XioNoX>	 !log rollback: move VRRP master from cr2-codfw to cr1-codfw - T226422
[19:24:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:59] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[19:25:40] <XioNoX>	 !log rollback deactivate transit links on cr2-codfw - T226422
[19:25:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:14] <XioNoX>	 !log rollback: increase OSPF cost on cr2-codfw links - T226422
[19:26:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:16] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Varnish: redirect eqsin/ulsfo text to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/531535
[19:29:29] <logmsgbot>	 !log ayounsi@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=swift,name=codfw
[19:29:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:40] <logmsgbot>	 !log ayounsi@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw
[19:29:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:17] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "Varnish: redirect eqsin/ulsfo text to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/531535 (owner: 10Ayounsi)
[19:30:30] <wikibugs>	 (03PS2) 10Ayounsi: Revert "Varnish: redirect eqsin/ulsfo text to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/531535
[19:30:48] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Depool codfw and eqsin for codfw routers work" [dns] - 10https://gerrit.wikimedia.org/r/531536
[19:31:41] <XioNoX>	 !log  Rollback: Varnish: redirect eqsin/ulsfo text to eqiad - T226422
[19:31:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:47] <stashbot>	 T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422
[19:32:22] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "Depool codfw and eqsin for codfw routers work" [dns] - 10https://gerrit.wikimedia.org/r/531536 (owner: 10Ayounsi)
[19:32:27] <wikibugs>	 (03PS2) 10Ayounsi: Revert "Depool codfw and eqsin for codfw routers work" [dns] - 10https://gerrit.wikimedia.org/r/531536
[19:34:38] <XioNoX>	 !log repool codfw and eqsin - T226422
[19:34:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:05] <wikibugs>	 10Operations, 10ops-codfw, 10netops: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 (10ayounsi) 05Open→03Resolved DONE! Everything is healthy, very little alert noise, no service impact.
[19:40:02] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:51:51] <XioNoX>	 chaomodus: ^ https://netbox.wikimedia.org/extras/reports/librenms.LibreNMS/run/ returns a server error
[19:52:12] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@adff5ad]: bulk_daemon: Track timeouts, log indices used, increase thread counts
[19:52:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:28] <XioNoX>	 we replaced the routing engines in codfw and updated them in netbox as well, so I guess it's related
[19:54:09] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10ayounsi)
[19:54:45] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@adff5ad]: bulk_daemon: Track timeouts, log indices used, increase thread counts (duration: 02m 34s)
[19:54:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:39] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@556c4d0]: bulk_daemon: Track timeouts, log indices used, increase thread counts
[19:57:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, halfak, and accraze: Dear deployers, time to do the Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T2000).
[20:00:26] <XioNoX>	 !log test l3 ECMP in ulsfo
[20:00:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:02:21] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@556c4d0]: bulk_daemon: Track timeouts, log indices used, increase thread counts (duration: 04m 42s)
[20:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:01] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@67103e9]: bulk_daemon: Correct super() call
[20:13:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:01] <wikibugs>	 10Operations, 10Traffic: Configure Layer3 hashing for router ECMP (for anycast DNS) - https://phabricator.wikimedia.org/T230955 (10BBlack)
[20:17:20] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@67103e9]: bulk_daemon: Correct super() call (duration: 04m 19s)
[20:17:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:26] <wikibugs>	 10Operations, 10Core Platform Team, 10Performance-Team, 10TechCom-RFC, and 4 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10kchapman)
[20:26:54] <wikibugs>	 10Operations, 10Core Platform Team, 10Performance-Team, 10TechCom-RFC, and 4 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10kchapman) @CCicalese_WMF could you review this from a product perspective and determine if it is something we want to do?
[20:38:56] <wikibugs>	 (03CR) 10Viztor: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor)
[20:39:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor)
[20:42:02] <wikibugs>	 (03PS6) 10Viztor: Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175
[20:42:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor)
[20:45:14] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Slaporte) >>! In T204056#5261399, @tramm wrote: > ... Wikimedia Eesti doesn't directly control any nameservers (however we control many DNS records of domains...
[20:48:06] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@fc270fd]: bulk_daemon: Retune popularity_score bulk sizing
[20:48:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:31] <wikibugs>	 (03PS7) 10Viztor: Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175
[20:51:55] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@fc270fd]: bulk_daemon: Retune popularity_score bulk sizing (duration: 03m 49s)
[20:51:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor)
[21:10:50] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10JAufrecht) > title, subtitle and description  I think what you have now, "Wikimedia Tech...
[21:30:18] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Aklapper) General reminder about [naming things](https://www.mediawiki.org/wiki/Naming_th...
[21:47:18] <icinga-wm>	 PROBLEM - SSH on labstore1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:48:44] <icinga-wm>	 RECOVERY - SSH on labstore1006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[22:00:10] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[22:01:40] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[22:38:34] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] "Looks good to me, pending deployment of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/530014/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529924 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli)
[22:39:30] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530769 (https://phabricator.wikimedia.org/T230680) (owner: 10MarcoAurelio)
[22:40:02] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10ayounsi) Scheduled for Thursday Sept 5th, 8am PST, 11am local time, 15:00 UTC. 3h
[22:52:22] <wikibugs>	 10Operations, 10Traffic, 10netops: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi)
[22:52:25] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10ayounsi)
[22:52:28] <wikibugs>	 10Operations, 10ops-codfw, 10netops: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 (10ayounsi)
[23:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T2300).
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.