[00:45:57] 10Operations, 10netops: csw2-esams's VCP link flapped - https://phabricator.wikimedia.org/T229755 (10ayounsi) 05Open→03Declined > I finished working on them but I was not able to match the digital trace to any software report like bug or PR. > When there is a core-dump alongside to an event that caused an... [01:11:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=302 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [01:14:17] PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [01:15:43] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 77327 bytes in 1.435 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:59:59] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Slaporte) @tramm I wanted to confirm that we got your email and we're looking into it. Chuck is out of office for the next few days, following his work at Wiki... [02:00:24] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Slaporte) a:05CRoslof→03Slaporte [02:24:02] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@adff5ad]: bulk_daemon: Handle non-integer status_code in json response [02:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:11] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@adff5ad]: bulk_daemon: Handle non-integer status_code in json response (duration: 04m 09s) [02:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:33] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 42910168 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:54:25] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 18848 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:52:45] (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable config status check for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531018 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [03:52:55] (03PS5) 10Vgutierrez: ATS: Disable config status check for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531018 (https://phabricator.wikimedia.org/T221594) [03:57:11] PROBLEM - check_trafficserver_tls_config_status on cp4021 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:58:11] PROBLEM - check_trafficserver_tls_config_status on cp5001 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:58:34] getting rid of checks is always noisy /o\ [03:58:55] PROBLEM - check_trafficserver_tls_config_status on cp2022 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:01] PROBLEM - check_trafficserver_tls_config_status on cp2017 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:05] PROBLEM - check_trafficserver_tls_config_status on cp1090 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:17] PROBLEM - check_trafficserver_tls_config_status on cp4026 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:17] PROBLEM - check_trafficserver_tls_config_status on cp4022 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:19] PROBLEM - check_trafficserver_tls_config_status on cp2005 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:19] PROBLEM - check_trafficserver_tls_config_status on cp2011 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:19] PROBLEM - check_trafficserver_tls_config_status on cp2018 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:21] PROBLEM - check_trafficserver_tls_config_status on cp2024 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:31] PROBLEM - check_trafficserver_tls_config_status on cp1084 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:31] PROBLEM - check_trafficserver_tls_config_status on cp1082 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:37] PROBLEM - check_trafficserver_tls_config_status on cp1086 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:37] PROBLEM - check_trafficserver_tls_config_status on cp3044 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:37] PROBLEM - check_trafficserver_tls_config_status on cp3034 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:41] PROBLEM - check_trafficserver_tls_config_status on cp1076 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:41] PROBLEM - check_trafficserver_tls_config_status on cp4025 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:43] PROBLEM - check_trafficserver_tls_config_status on cp2002 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:43] PROBLEM - check_trafficserver_tls_config_status on cp2008 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:45] PROBLEM - check_trafficserver_tls_config_status on cp5002 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:45] PROBLEM - check_trafficserver_tls_config_status on cp5006 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:45] PROBLEM - check_trafficserver_tls_config_status on cp5004 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:45] PROBLEM - check_trafficserver_tls_config_status on cp5005 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:49] PROBLEM - check_trafficserver_tls_config_status on cp2025 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:51] PROBLEM - check_trafficserver_tls_config_status on cp2014 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:55] PROBLEM - check_trafficserver_tls_config_status on cp3045 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:59:57] PROBLEM - check_trafficserver_tls_config_status on cp4023 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:00:01] PROBLEM - check_trafficserver_tls_config_status on cp3039 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:00:01] PROBLEM - check_trafficserver_tls_config_status on cp3035 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:00:05] PROBLEM - check_trafficserver_tls_config_status on cp1078 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:00:07] PROBLEM - check_trafficserver_tls_config_status on cp2026 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:00:09] PROBLEM - check_trafficserver_tls_config_status on cp3047 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:00:11] PROBLEM - check_trafficserver_tls_config_status on cp1088 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:00:11] PROBLEM - check_trafficserver_tls_config_status on cp1080 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:00:17] PROBLEM - check_trafficserver_tls_config_status on cp4024 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:00:18] (03PS10) 10CRusnov: backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) [04:00:19] PROBLEM - check_trafficserver_tls_config_status on cp3036 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:00:19] PROBLEM - check_trafficserver_tls_config_status on cp3046 is CRITICAL: NRPE: Command check_check_trafficserver_tls_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:01:34] sorry about that.. that check has been removed [04:02:15] (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable TCP Fast Open for the TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531027 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [04:02:26] (03PS4) 10Vgutierrez: ATS: Enable TCP Fast Open for the TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531027 (https://phabricator.wikimedia.org/T221594) [04:07:11] (03CR) 10jerkins-bot: [V: 04-1] backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [04:15:11] (03PS1) 10CRusnov: netbox: Make host private and add exception on not found [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072) [04:23:08] (03PS1) 10Vgutierrez: Release 8.0.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/531332 [04:34:59] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [04:35:09] (03PS9) 10Vgutierrez: prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) [04:41:17] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 490 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [04:52:27] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T230682 (10Marostegui) 05Open→03Resolved Thank you Chris! This looks good now ` root@db1063:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name... [04:53:36] (03PS1) 10Vgutierrez: prometheus: Consider the new layer label for ATS aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/531334 (https://phabricator.wikimedia.org/T221594) [04:58:03] (03PS1) 10Marostegui: db1122: Change binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/531335 (https://phabricator.wikimedia.org/T230785) [04:58:44] (03CR) 10Marostegui: [C: 03+2] db1122: Change binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/531335 (https://phabricator.wikimedia.org/T230785) (owner: 10Marostegui) [05:01:03] (03PS1) 10Marostegui: db-eqiad.php: Clarify that db1122 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531336 (https://phabricator.wikimedia.org/T230785) [05:02:08] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Clarify that db1122 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531336 (https://phabricator.wikimedia.org/T230785) (owner: 10Marostegui) [05:03:01] (03Merged) 10jenkins-bot: db-eqiad.php: Clarify that db1122 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531336 (https://phabricator.wikimedia.org/T230785) (owner: 10Marostegui) [05:03:16] (03CR) 10jenkins-bot: db-eqiad.php: Clarify that db1122 is the candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531336 (https://phabricator.wikimedia.org/T230785) (owner: 10Marostegui) [05:04:31] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Clarify db1122 status: candidate master for s2 - T230785 (duration: 00m 55s) [05:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:40] T230785: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 [05:05:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1122 for binlog format change', diff saved to https://phabricator.wikimedia.org/P8949 and previous config saved to /var/cache/conftool/dbconfig/20190821-050501-marostegui.json [05:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:46] !log Restart MySQL on db1122 for binlog format change - T230785 [05:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1122 after restart', diff saved to https://phabricator.wikimedia.org/P8950 and previous config saved to /var/cache/conftool/dbconfig/20190821-051441-marostegui.json [05:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'More weight to db1122', diff saved to https://phabricator.wikimedia.org/P8951 and previous config saved to /var/cache/conftool/dbconfig/20190821-052613-marostegui.json [05:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:59] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:34:17] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:45:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'More weight to db1122', diff saved to https://phabricator.wikimedia.org/P8952 and previous config saved to /var/cache/conftool/dbconfig/20190821-054542-marostegui.json [05:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:33] (03CR) 10ArielGlenn: [C: 03+1] dumps::web::htmldumps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531226 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [05:48:31] (03CR) 10ArielGlenn: [C: 03+1] dumps::generation::server: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531212 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [05:49:05] (03CR) 10ArielGlenn: [C: 03+1] dumps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531276 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [06:12:13] 10Operations, 10Discovery-Search (Current work): Run jstack / jmap / etc... with PrivateTmp=true - https://phabricator.wikimedia.org/T230774 (10Joe) p:05Triage→03Normal [06:24:38] (03PS2) 10Muehlenhoff: Setup partman config for puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/531190 [06:28:15] (03CR) 10Muehlenhoff: [C: 03+2] Setup partman config for puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/531190 (owner: 10Muehlenhoff) [06:29:27] I just checked the OSPF alarms, is related to the Zayo circuit between cr2-codfw and cr2-eqiad. There is maintenance scheduled [06:29:43] so everything seems ok :) [06:39:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/531211 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [06:43:47] RECOVERY - snapshot of s7 in codfw on db1115 is OK: snapshot for s7 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-08-21 03:47:27 from db2100.codfw.wmnet:3317 (850 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [06:44:08] (03CR) 10Muehlenhoff: backup::ofsite: add ipv6 mapped address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531233 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [06:56:12] (03CR) 10Muehlenhoff: "Let's split this and first take the mwdebug servers and canaries." [puppet] - 10https://gerrit.wikimedia.org/r/531256 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [07:00:55] (03CR) 10Muehlenhoff: "I don't see a patch for redis/eqiad lined up or maybe just not added with reviewers yet?" [puppet] - 10https://gerrit.wikimedia.org/r/531267 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [07:09:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/531266 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [07:11:58] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch::relforge: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531271 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [07:24:51] 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10Joe) [07:30:46] 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10MoritzMuehlenhoff) We maintain custom 7.2 packages anyway (based on the 7.2.x releases), we can cherrypick the patch for our package upd... [07:41:03] (03CR) 10Vgutierrez: [C: 03+1] ATS: enable compress.so for upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/530823 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [07:44:24] (03PS2) 10Ema: ATS: enable compress.so for upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/530823 (https://phabricator.wikimedia.org/T227432) [07:45:08] (03CR) 10Ema: [C: 03+2] ATS: enable compress.so for upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/530823 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [07:47:41] (03CR) 10Ema: [C: 03+1] Swap analytics-tool1002 with an-tool1007 in caching config [puppet] - 10https://gerrit.wikimedia.org/r/531154 (https://phabricator.wikimedia.org/T230709) (owner: 10Elukey) [07:53:49] 10Operations, 10DBA: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 (10Marostegui) [07:53:52] 10Operations, 10DBA: Decommission db2059.codfw.wmnet - https://phabricator.wikimedia.org/T230884 (10Marostegui) [07:53:57] 10Operations, 10DBA: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Marostegui) [07:54:12] 10Operations, 10DBA: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 (10Marostegui) p:05Triage→03Normal [07:54:23] 10Operations, 10DBA: Decommission db2059.codfw.wmnet - https://phabricator.wikimedia.org/T230884 (10Marostegui) p:05Triage→03Normal [07:54:29] 10Operations, 10DBA: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Marostegui) p:05Triage→03Normal [07:55:16] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [07:56:45] !log upload@eqsin: rolling ats-backend-restart to enable compress plugin [07:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1122', diff saved to https://phabricator.wikimedia.org/P8953 and previous config saved to /var/cache/conftool/dbconfig/20190821-075813-marostegui.json [07:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:42] 10Operations, 10DBA: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 (10Marostegui) Binary log format changed on db1122, host upgraded and rebooted. [08:00:17] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2052 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531362 (https://phabricator.wikimedia.org/T230883) [08:02:14] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2052 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531362 (https://phabricator.wikimedia.org/T230883) (owner: 10Marostegui) [08:02:43] (03PS1) 10Marostegui: mariadb: Decommission db2052 [puppet] - 10https://gerrit.wikimedia.org/r/531380 (https://phabricator.wikimedia.org/T230883) [08:02:57] (03CR) 10Marostegui: [C: 03+1] "Let's go for these two hosts after the test ones" [puppet] - 10https://gerrit.wikimedia.org/r/531203 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [08:03:12] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2052 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531362 (https://phabricator.wikimedia.org/T230883) (owner: 10Marostegui) [08:03:30] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2052 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531362 (https://phabricator.wikimedia.org/T230883) (owner: 10Marostegui) [08:04:03] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2052 [puppet] - 10https://gerrit.wikimedia.org/r/531380 (https://phabricator.wikimedia.org/T230883) (owner: 10Marostegui) [08:04:37] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2052 from config T230883 (duration: 00m 54s) [08:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:46] T230883: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 [08:05:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2052 from config T230883 (duration: 00m 54s) [08:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:29] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 (10Marostegui) [08:11:28] !log Remove db2052 from tendril and zarcillo T230883 [08:11:34] !log Stop MySQL on db2052 T230883 [08:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:36] T230883: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 [08:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:33] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 (10Marostegui) a:05Marostegui→03RobH [08:12:54] (03PS1) 10Hashar: Remove role::ci::slave::webperformance [puppet] - 10https://gerrit.wikimedia.org/r/531420 (https://phabricator.wikimedia.org/T225416) [08:12:56] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 (10Marostegui) This host is ready for #dc-ops to decommission [08:13:09] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [08:16:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10MoritzMuehlenhoff) Did the technician replace the mainboard? [08:18:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:18:47] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:18:56] !log installing puppetdb2002 [08:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:44] (03PS2) 10Elukey: Add base configuration for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531277 (https://phabricator.wikimedia.org/T227025) [08:21:05] (03PS2) 10Ema: varnishlog: request/response headers to send to logstash [puppet] - 10https://gerrit.wikimedia.org/r/520425 (https://phabricator.wikimedia.org/T189333) [08:21:28] (03PS1) 10Tarrow: Termbox Staging Test - Allow numeric characters in language codes [deployment-charts] - 10https://gerrit.wikimedia.org/r/531426 [08:21:30] (03PS1) 10Tarrow: Termbox Staging - Allow numeric characters in language codes [deployment-charts] - 10https://gerrit.wikimedia.org/r/531427 [08:21:32] (03PS1) 10Tarrow: Termbox codfw - Allow numeric characters in language codes [deployment-charts] - 10https://gerrit.wikimedia.org/r/531428 [08:21:34] (03PS1) 10Tarrow: Termbox eqiad - Allow numeric characters in language codes [deployment-charts] - 10https://gerrit.wikimedia.org/r/531429 [08:25:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This first needs a sudo rule that will allow mwdeploy to restart php-fpm" [puppet] - 10https://gerrit.wikimedia.org/r/530014 (https://phabricator.wikimedia.org/T224857) (owner: 10Thcipriani) [08:25:46] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531215 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [08:25:54] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531216 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [08:26:13] (03CR) 10Hashar: "I cant tell what are the impact of adding ipv6 to thumbor sorry :\" [puppet] - 10https://gerrit.wikimedia.org/r/531278 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [08:27:07] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/530014/ needs to be merged before we proceed further." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529924 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [08:27:23] (03CR) 10Elukey: [C: 03+2] Add base configuration for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531277 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey) [08:28:31] new zookeeper nodes for analytics --^ [08:28:34] \o/ [08:29:04] hopefully we'll move analytics-related znodes away from conf* soon [08:29:07] !log upgrading PHP on contint* [08:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto) [08:32:34] (03PS3) 10Ema: varnishlog: request/response headers to send to logstash [puppet] - 10https://gerrit.wikimedia.org/r/520425 (https://phabricator.wikimedia.org/T189333) [08:34:27] (03CR) 10Ema: [V: 03+2 C: 03+2] varnishlog: request/response headers to send to logstash [puppet] - 10https://gerrit.wikimedia.org/r/520425 (https://phabricator.wikimedia.org/T189333) (owner: 10Ema) [08:36:48] (03PS4) 10Gehel: wdqs: restrict port 8888 to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/530856 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [08:37:39] (03PS1) 10Tarrow: Enable Termbox on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531433 (https://phabricator.wikimedia.org/T230896) [08:38:50] (03PS2) 10Tarrow: Enable Termbox on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531433 (https://phabricator.wikimedia.org/T230896) [08:38:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add debian package build [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517979 (owner: 10Giuseppe Lavagetto) [08:39:39] (03CR) 10Gehel: [C: 03+2] wdqs: restrict port 8888 to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/530856 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [08:40:35] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:41:03] 10Operations, 10Wikimedia-Logstash, 10Wikimedia-Incident: Logstash gets significantly lower number of messages from mediawiki - https://phabricator.wikimedia.org/T230847 (10fgiunchedi) p:05Unbreak!→03Normal Downgrading to normal since the situation has stabilized and returned to normal as 23:40 UTC, stil... [08:41:10] (03CR) 10Tarrow: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531433 (https://phabricator.wikimedia.org/T230896) (owner: 10Tarrow) [08:48:29] (03CR) 10Jakob: [C: 03+2] "This change is ready for review." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/531426 (owner: 10Tarrow) [08:48:43] (03PS1) 10Elukey: Add AAAA/A/PTR records for an-conf100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/531435 (https://phabricator.wikimedia.org/T227025) [08:51:30] (03PS2) 10Elukey: Add AAAA/A/PTR records for an-conf100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/531435 (https://phabricator.wikimedia.org/T227025) [08:51:33] anybody up for a quick DNS review? [08:51:36] --^ [08:51:42] (new hosts) [08:51:48] lookinh [08:52:26] thanksss [08:53:59] (03CR) 10Jakob: [V: 03+2 C: 03+2] Termbox Staging Test - Allow numeric characters in language codes [deployment-charts] - 10https://gerrit.wikimedia.org/r/531426 (owner: 10Tarrow) [08:56:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/531435 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey) [08:56:29] 10Operations, 10Wikimedia-Logstash, 10Wikimedia-Incident: Logstash gets significantly lower number of messages from mediawiki - https://phabricator.wikimedia.org/T230847 (10fgiunchedi) Looks like a logstash consumer has failed, according to kafka logs on logstash1010 ` [2019-08-20 23:09:36,042] INFO [GroupC... [08:57:40] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'test' . [08:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:42] (03CR) 10Elukey: [C: 03+2] Add AAAA/A/PTR records for an-conf100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/531435 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey) [08:59:38] (03CR) 10Gehel: [C: 04-1] Add maps reboot cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [09:00:04] tarrow and jakob_wmde: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Initial deployment of the new mobile termbox on Wikidata . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T0900). [09:07:23] <_joe_> !log uploaded python-poolcounter to stretch,buster [09:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:45] (03CR) 10Jakob: [V: 03+2 C: 03+2] "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/531427 (owner: 10Tarrow) [09:09:29] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'staging' . [09:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:43] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:09:49] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:09:49] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "lgtm, one small nit (optional)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531291 (owner: 10CDanis) [09:14:57] (03CR) 10Jakob: [V: 03+2 C: 03+2] "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/531428 (owner: 10Tarrow) [09:15:25] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'termbox' for release 'production' . [09:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:15] (03PS18) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [09:18:28] (03CR) 10jerkins-bot: [V: 04-1] Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [09:19:13] (03CR) 10Jakob: [V: 03+2 C: 03+2] "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/531429 (owner: 10Tarrow) [09:20:39] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'termbox' for release 'production' . [09:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:49] (03CR) 10Marostegui: [C: 03+1] "These are passive, so let's deploy there and check" [puppet] - 10https://gerrit.wikimedia.org/r/531262 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:24:33] (03CR) 10Jbond: [C: 03+2] dumps::web::htmldumps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531226 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:24:34] Right, we're enabling termbox on wikidatawiki now [09:24:38] (03PS2) 10Jbond: dumps::web::htmldumps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531226 (https://phabricator.wikimedia.org/T102099) [09:25:04] tarrow: what's the expected impact of that? [09:25:24] (03PS19) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [09:25:28] some actual load on the termbox service [09:25:45] a little (really a drop in the ocean) more load on the api appservers [09:26:01] (03CR) 10Jakob: [C: 03+2] Enable Termbox on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531433 (https://phabricator.wikimedia.org/T230896) (owner: 10Tarrow) [09:26:02] tarrow: gotcha, thanks :-) [09:26:17] (03CR) 10Jbond: [C: 03+2] dumps::generation::server: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531212 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:26:20] (03CR) 10Mathew.onipe: Add maps reboot cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [09:26:24] (03PS2) 10Jbond: dumps::generation::server: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531212 (https://phabricator.wikimedia.org/T102099) [09:27:20] and there might be a small increase in the PCache size (again we expect this to be small) [09:27:26] (03Merged) 10jenkins-bot: Enable Termbox on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531433 (https://phabricator.wikimedia.org/T230896) (owner: 10Tarrow) [09:28:22] (03CR) 10jenkins-bot: Enable Termbox on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531433 (https://phabricator.wikimedia.org/T230896) (owner: 10Tarrow) [09:28:33] tarrow: thank you :) [09:28:58] (03CR) 10Jbond: [C: 03+2] dumps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531276 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:29:06] (03PS2) 10Jbond: dumps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531276 (https://phabricator.wikimedia.org/T102099) [09:29:15] !log rebooting db2102 (reverting to a proper stretch 4.9 kernel, it used a bpo kernel due to some hardware debuging a while back) [09:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:05] moritzm: did you downtime it? [09:30:20] yeah and stopped mariadb.service [09:30:23] \o/ [09:30:46] for mariabd::core_test roles I also need to manually start mariadb.service when it's back, right? [09:30:52] yeah [09:30:56] ack [09:30:56] I can do that once it is back if you want [09:30:59] and start replication [09:31:07] moritzm: just let me know when it is back up [09:31:08] (03CR) 10Mobrovac: [C: 03+1] restbase: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531272 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:31:14] ack, I'll ping you when it's booted to the correct kernel [09:31:18] thanks [09:32:09] (03CR) 10Jbond: [C: 03+2] debmonitor: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531211 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:32:16] (03PS2) 10Jbond: debmonitor: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531211 (https://phabricator.wikimedia.org/T102099) [09:34:25] marostegui: it's up and I've purged the old backports kernel, so further reboots won't need manual selection of the 4.9 kernel in Grub [09:34:37] great! I will take care of mysql [09:34:38] thanks [09:34:42] (03PS2) 10Jbond: backup::offsite: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531233 (https://phabricator.wikimedia.org/T102099) [09:34:57] (03CR) 10Jbond: backup::offsite: add ipv6 mapped address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531233 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:36:25] !log tarrow@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:531433|Enable Termbox on wikidatawiki (T230896)]] (duration: 00m 55s) [09:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:33] T230896: Enable Termbox on wikidatawiki - https://phabricator.wikimedia.org/T230896 [09:38:34] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/531267 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:39:39] (03PS2) 10Jbond: puppetboard/puppetdb: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531266 (https://phabricator.wikimedia.org/T102099) [09:40:55] (03Abandoned) 10Giuseppe Lavagetto: hiera: fix the hierarchical order of lookups [puppet] - 10https://gerrit.wikimedia.org/r/475500 (owner: 10Giuseppe Lavagetto) [09:41:01] (03CR) 10Jbond: [C: 03+2] puppetboard/puppetdb: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531266 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:46:59] !log finished enabling termbox on wikidatawiki [09:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:52] (03PS2) 10Jbond: mariadb::parsercache - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531262 (https://phabricator.wikimedia.org/T102099) [09:54:45] jouncebot: next [09:54:45] In 1 hour(s) and 5 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T1100) [09:56:58] (03CR) 10Jbond: [C: 03+2] mariadb::parsercache - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531262 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:00:43] (03PS3) 10Jbond: mariadb::core_multiinstance - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531164 (https://phabricator.wikimedia.org/T102099) [10:01:41] (03CR) 10Jbond: [C: 03+2] mariadb::core_multiinstance - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531164 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:02:04] !log installing puppetdb1002 [10:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:45] (03CR) 10Marostegui: [C: 03+1] "These hosts are passive, so let's deploy there" [puppet] - 10https://gerrit.wikimedia.org/r/531209 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:06:48] (03CR) 10Marostegui: [C: 03+1] mariadb::temporary_storage: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531217 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:07:17] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [10:08:49] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [10:09:28] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [10:10:30] yeah definitely we need some tweaks on the latency alerts, a bit too touchy now even if true [10:11:12] (03PS2) 10Jbond: MW servers - eqiad (canary and debug): add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531256 (https://phabricator.wikimedia.org/T102099) [10:11:14] (03PS1) 10Jbond: MW servers - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531453 (https://phabricator.wikimedia.org/T102099) [10:12:23] (03PS3) 10Jbond: MW servers - eqiad (canary and debug): add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531256 (https://phabricator.wikimedia.org/T102099) [10:12:31] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:12:35] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/531256 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:12:57] godog: +1 [10:14:29] (03PS2) 10Jbond: mariadb::temporary_storage: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531217 (https://phabricator.wikimedia.org/T102099) [10:15:27] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531256 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:15:42] (03CR) 10Jbond: [C: 03+2] mariadb::temporary_storage: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531217 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:23:22] (03PS1) 10Muehlenhoff: puppetdb/postgres: Drop support for jessie, add support for buster [puppet] - 10https://gerrit.wikimedia.org/r/531454 [10:25:08] (03PS3) 10Jbond: mariadb::proxy - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531209 (https://phabricator.wikimedia.org/T102099) [10:25:52] (03CR) 10Jbond: [C: 03+2] mariadb::proxy - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531209 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:27:16] (03PS1) 10Marostegui: wmnet: Update s8-master record [dns] - 10https://gerrit.wikimedia.org/r/531455 (https://phabricator.wikimedia.org/T230762) [10:27:27] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/531455 (https://phabricator.wikimedia.org/T230762) (owner: 10Marostegui) [10:28:02] (03PS2) 10Marostegui: mariadb: Promote db1109 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/531189 (https://phabricator.wikimedia.org/T230762) [10:28:12] (03PS4) 10Marostegui: mariadb: Promote db1133 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/529331 (https://phabricator.wikimedia.org/T229657) [10:28:19] (03PS3) 10Marostegui: wmnet: Promote db1133 to m5 master [dns] - 10https://gerrit.wikimedia.org/r/529333 (https://phabricator.wikimedia.org/T229657) [10:28:53] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=404 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [10:34:22] (03CR) 10Jbond: [C: 03+2] MW servers - eqiad (canary and debug): add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531256 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:34:34] (03PS4) 10Jbond: MW servers - eqiad (canary and debug): add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531256 (https://phabricator.wikimedia.org/T102099) [10:35:50] heads up im deploying the ipv6 mapped change to the mw canary and debug hosts. i dont expect any impact https://gerrit.wikimedia.org/r/c/operations/puppet/+/531256 [10:35:53] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [10:40:15] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:42:16] (03PS2) 10Jbond: restbase: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531272 (https://phabricator.wikimedia.org/T102099) [10:42:17] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:45:00] (03CR) 10Jbond: [C: 03+2] restbase: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531272 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:45:48] !log Run mwscript namespaceDupes.php --wiki=zhwikisource --add-prefix=FIXME --fix (T230548) [10:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:57] T230548: Shortcut namespace redirect on zhwikisource - https://phabricator.wikimedia.org/T230548 [10:47:34] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 23 probes of 449 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:49:35] !log Wrapped code added to CommonSettings.php in T230601 to wgExtensionFunctions [10:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:43] T230601: Groups 'oversight'/'suppress' should be reconciled - https://phabricator.wikimedia.org/T230601 [10:49:54] !log Previous log entry was for mwdebug1002 [10:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531454 (owner: 10Muehlenhoff) [10:52:43] !log Move 0a87e3c's code to abusefilter.php on mwdebug1002 (T230601) [10:52:48] (03CR) 10Gehel: [C: 03+1] elasticsearch::relforge: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531271 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:29] (03PS2) 10Jbond: elasticsearch::relforge: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531271 (https://phabricator.wikimedia.org/T102099) [10:54:05] 10Operations, 10DBA, 10MediaWiki-Configuration, 10discovery-system: Allow use of EtcdConfig to configure slave databases - https://phabricator.wikimedia.org/T185084 (10Marostegui) @Joe can this be considered done already with `dbctl`? [10:55:16] 10Operations, 10MediaWiki-Configuration, 10discovery-system: Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597 (10Joe) [10:55:19] 10Operations, 10DBA, 10MediaWiki-Configuration, 10discovery-system: Allow use of EtcdConfig to configure slave databases - https://phabricator.wikimedia.org/T185084 (10Joe) 05Open→03Resolved a:03Joe Indeed! we're doing more than this! [10:57:05] !log Run scap pull on mwdebug1002 (T230601) [10:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:13] T230601: Groups 'oversight'/'suppress' should be reconciled - https://phabricator.wikimedia.org/T230601 [10:57:48] 10Operations, 10MediaWiki-Configuration, 10discovery-system: Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597 (10Marostegui) [10:57:55] (03CR) 10Jbond: [C: 03+2] elasticsearch::relforge: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531271 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:00:04] Amir1, Lucas_WMDE, and Urbanecm: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T1100). [11:00:04] alaa_wmde: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:05:49] (03CR) 10Volans: [C: 04-1] "A minor thing and a nit, see inline." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [11:06:11] hello there :) I've got a config change, any deployer up for it? [11:11:25] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/17967/" [puppet] - 10https://gerrit.wikimedia.org/r/531454 (owner: 10Muehlenhoff) [11:11:37] (03PS2) 10Muehlenhoff: puppetdb/postgres: Drop support for jessie, add support for buster [puppet] - 10https://gerrit.wikimedia.org/r/531454 [11:12:08] (03CR) 10Marostegui: [C: 03+1] mariadb::misc::phabricator - codfw: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/531195 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:13:11] alaa_wmde: let me do it [11:13:47] thanks @Amir1 [11:15:20] (03PS1) 10Elukey: Add new partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531466 (https://phabricator.wikimedia.org/T227025) [11:15:32] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:15:53] (03CR) 10Muehlenhoff: [C: 03+2] puppetdb/postgres: Drop support for jessie, add support for buster [puppet] - 10https://gerrit.wikimedia.org/r/531454 (owner: 10Muehlenhoff) [11:16:01] Amir1: thanks! [11:16:09] (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531162 (owner: 10Alaa Sarhan) [11:16:23] Wow, so many reverts [11:16:44] (03PS2) 10Elukey: Add new partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531466 (https://phabricator.wikimedia.org/T227025) [11:17:01] Urbanecm: it's actually more, Alaa forked it [11:17:18] I have one small issue, I can't move between panes in tmux [11:17:19] (03PS3) 10Elukey: Add new partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531466 (https://phabricator.wikimedia.org/T227025) [11:17:39] my up arrow key is broken... [11:19:15] I need a couple of minutes to remap my keyboard [11:19:41] (03CR) 10Elukey: [C: 03+2] Add new partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531466 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey) [11:20:14] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:20:56] Amir1: I bought an external keyboard for that purpose [11:21:08] (i doesn't work on my normal keyboard) [11:22:22] > Wow, so many reverts [11:22:22] yeap it is the switch that flips the table upside-down on terms store... and clients weren't fully tested unfortunately leading to some bugs being discovered only in production [11:22:53] Got it :) [11:24:00] (03CR) 10Jbond: [C: 03+2] mariadb::misc::tendril: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531203 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:24:08] (03PS2) 10Jbond: mariadb::misc::tendril: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531203 (https://phabricator.wikimedia.org/T102099) [11:24:46] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:27:13] (03CR) 10Jbond: [C: 03+2] elasticsearch::cirrus - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531215 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:27:21] (03PS2) 10Jbond: elasticsearch::cirrus - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531215 (https://phabricator.wikimedia.org/T102099) [11:27:27] made some progress will be back soon [11:27:54] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:28:28] (03PS1) 10Elukey: Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) [11:28:58] moritzm: --^ [11:29:36] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluste [11:29:36] ethod=GET [11:29:45] looking [11:31:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline, but feel free to ignore." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey) [11:32:18] (03CR) 10Elukey: Use standard partman recipe for an-conf100[1-3] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey) [11:33:26] (03PS2) 10Elukey: Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) [11:33:55] (03PS3) 10Elukey: Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) [11:34:24] PROBLEM - Check systemd state on ores1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:28] PROBLEM - Check systemd state on ores1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:30] PROBLEM - Check systemd state on ores1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:50] ^ checking [11:34:55] celery-ores-worker.service [11:34:56] unless someone knows something [11:35:02] tx [11:35:15] nono please go [11:35:22] I am checking as well [11:35:58] RECOVERY - Check systemd state on ores1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:58] (03PS2) 10Jbond: elasticsearch::cirrus - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531216 (https://phabricator.wikimedia.org/T102099) [11:36:21] I think it restarted itself [11:36:35] Urbanecm: can you do it? my up key doesn't work and I couldn't make it to work [11:36:37] (03CR) 10Jbond: [C: 03+2] elasticsearch::cirrus - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531216 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:36:50] not on 1002 (I am on it) [11:37:21] Amir1: well I'm on mobile, I'm sorry [11:37:38] RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:49] !log restart celery-ores-worker on ores1002 [11:37:51] elukey: should we just start them ? [11:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:13] I think so yes, can't find why they died in the logs though [11:38:15] hmm, I can buy a keyboard fairly quickly but we need to extend the swat a bit [11:38:29] !log Restarting ores on ores1004 and ores1005 [11:38:30] I can probably deploy it, but i'm not sure i would be also able to revert it should it be needed [11:38:33] lol no no we can do it tomorrow too [11:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:39] elukey: jijiki i have started 1005 [11:38:41] no need to stress and rush into buying a keyboard now [11:38:53] alaa_wmde: are you sure? I need the keyboard anyway :D [11:39:00] In that case, /me votes for rescheduling :D [11:39:04] jbond42: oh I had no idea [11:39:08] RECOVERY - Check systemd state on ores1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:20] maybe we all overreacted here :p [11:39:51] sorry i had only just come to the party and elu.key said to restart and i was there [11:40:08] Amir1: yeah we can do that tomorrow .. maybe I do my first deployment while you're watching over my shoulders ;) [11:40:48] (03CR) 10Muehlenhoff: [C: 03+1] Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey) [11:41:33] well celery down out of the blue for ores is not really great :D [11:41:40] better to restart than leaving it broken [11:41:59] !log EU SWAT is done [11:42:03] alaa_wmde: suer! [11:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:16] alaa_wmde: you have deployment privs? [11:42:27] (03PS4) 10Elukey: Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) [11:42:29] (03CR) 10Jbond: [C: 03+2] idp: remove ldap[0].providerClass [puppet] - 10https://gerrit.wikimedia.org/r/530409 (owner: 10Jbond) [11:42:34] (03PS2) 10Jbond: idp: remove ldap[0].providerClass [puppet] - 10https://gerrit.wikimedia.org/r/530409 [11:42:56] Urbanecm: I must have them by now yes [11:44:01] (03PS2) 10Jbond: mariadb::misc::phabricator - codfw: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/531195 (https://phabricator.wikimedia.org/T102099) [11:44:10] (03PS5) 10Elukey: Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) [11:44:29] alaa_wmde: so you can do it yourself next time, good to know :d [11:44:43] (03CR) 10Jbond: [C: 03+2] mariadb::misc::phabricator - codfw: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/531195 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:44:50] so now we should figure out why celery went down [11:46:02] (03CR) 10Elukey: [C: 03+2] Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey) [11:46:09] (03PS6) 10Elukey: Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) [11:46:11] (03CR) 10Elukey: [V: 03+2 C: 03+2] Use standard partman recipe for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531469 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey) [11:49:16] Urbanecm: yeap once I do my first deployment (or first few with someone by my side for quick questions) then you'll probably see me lurking here a lot more often :) [11:49:58] Good! [11:50:07] Feel free to ask questions if you have any [11:50:44] Amir1: o/ - do you have any idea why celery on some ores100X nodes could shutdown? [11:52:04] elukey: i see this in the logs https://phabricator.wikimedia.org/P8955 on at least 1005 and 1002 [11:52:52] jbond42: I checked the code and it doesn't seem to abruptively cause a shutdown, and the rest seems more or less what celery emits when shuttting down afaics.. [11:52:55] really strange [11:53:03] plus are we missing alarms for celery? [11:54:12] yes i dont really know anything about celery im afraid [11:54:22] me too :( [11:56:10] One edge case we've documented is that any connectivity glitch between Celery and ORES's Redis cluster will cause the celery service to go zombie. [11:56:31] awight: o/ [11:56:38] go zombie means shutting down ? [11:56:44] There's a watchdog which restarts every 15 minutes or something. [11:57:14] elukey: I thought I remember them continuing to run but never accepting further jobs? But shutting down is also a failure we saw, maybe caused by something else. [11:57:29] Would be good to review the memory usage graphs, IMO. [11:58:11] all right opening a task :) [11:58:33] (03PS2) 10Jbond: mw servers - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531255 (https://phabricator.wikimedia.org/T102099) [11:59:21] (03CR) 10Jbond: [C: 03+2] mw servers - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531255 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:59:48] !log add ipv6 mapped address to mw codfw servers [11:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T1200) [12:01:07] (03PS2) 10Effie Mouzeli: mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) [12:01:15] (03PS1) 10Effie Mouzeli: mediawiki::users: Allow adding privileges via profiles [puppet] - 10https://gerrit.wikimedia.org/r/531474 [12:01:18] (03PS1) 10Effie Mouzeli: mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) [12:02:48] Amir1: fwiw, ^-A will switch panes I believe [12:04:46] (03CR) 10Gergő Tisza: [C: 03+1] Log dnsblacklist entries at info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531299 (https://phabricator.wikimedia.org/T230822) (owner: 10Urbanecm) [12:06:50] This doesn't look healthy, https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?refresh=1m&panelId=25&fullscreen&orgId=1&from=now-7d&to=now-1m [12:06:59] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,3,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey) [12:07:06] there you go --^ [12:07:47] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey) [12:12:15] (03CR) 10Filippo Giunchedi: [C: 04-1] "You should append "_layer" to the resulting metric, to indicate layer is there too now. Also consider that this will create new metrics an" [puppet] - 10https://gerrit.wikimedia.org/r/531334 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [12:12:50] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) @Cmjohnson I was able to install the OS on an-conf1001 via manual PXE install, but I had to set in the BIOS the following serial console setting: `S... [12:22:45] elukey: Thank you! [12:23:27] (03CR) 10Filippo Giunchedi: [C: 04-1] "Prometheus hosts already have ipv6 afaics" [puppet] - 10https://gerrit.wikimedia.org/r/531264 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:23:54] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531243 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:24:07] (03CR) 10Filippo Giunchedi: [C: 03+1] swift - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531251 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:24:11] (03CR) 10Filippo Giunchedi: [C: 03+1] swift - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531252 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:28:18] (03CR) 10Muehlenhoff: "Already present in profile::lvs" [puppet] - 10https://gerrit.wikimedia.org/r/531244 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:29:00] (03CR) 10Muehlenhoff: "Already present in role::installserver" [puppet] - 10https://gerrit.wikimedia.org/r/531237 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:38:02] (03PS2) 10Effie Mouzeli: mediawiki::users: Allow adding privileges to mwdeploy user [puppet] - 10https://gerrit.wikimedia.org/r/531474 [12:44:58] (03CR) 10Effie Mouzeli: [V: 04-1 C: 04-1] "Something is very wrong" [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) (owner: 10Effie Mouzeli) [12:45:11] (03CR) 10Effie Mouzeli: [V: 04-1 C: 04-1] "Something is very wrong" [puppet] - 10https://gerrit.wikimedia.org/r/531474 (owner: 10Effie Mouzeli) [12:46:42] (03PS2) 10Effie Mouzeli: mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) [12:50:38] 10Operations, 10serviceops: Update component/php72 to 7.2.21 - https://phabricator.wikimedia.org/T230024 (10MoritzMuehlenhoff) [12:52:36] 10Operations, 10serviceops: Update component/php72 to 7.2.21 - https://phabricator.wikimedia.org/T230024 (10MoritzMuehlenhoff) I'm running into a build failure, which I initially assumed was caused by DNS resolution in pbuilder/boron, but it's ultimately caused by MariaDB; the build calls mysql_install_db from... [13:00:04] zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - European version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T1300). [13:00:46] (03PS3) 10Effie Mouzeli: mediawiki::users: Allow adding privileges to mwdeploy user [puppet] - 10https://gerrit.wikimedia.org/r/531474 [13:00:58] thank you jouncebot for the reminder [13:01:00] (03PS3) 10Effie Mouzeli: mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) [13:01:06] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [13:01:39] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2059 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531480 (https://phabricator.wikimedia.org/T230884) [13:02:40] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [13:05:20] (03PS4) 10Effie Mouzeli: mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) [13:07:06] 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10tstarling) Cherry pick is not exactly the right word, I'm just proposing a temporary hack so that it will maybe work, whereas PHP 7.3 do... [13:08:03] (03PS1) 10Zfilipin: group1 wikis to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531481 [13:08:05] (03CR) 10Zfilipin: [C: 03+2] group1 wikis to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531481 (owner: 10Zfilipin) [13:09:19] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531481 (owner: 10Zfilipin) [13:09:54] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531481 (owner: 10Zfilipin) [13:11:02] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.19 [13:11:22] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [13:11:37] ^ downtime expired [13:11:42] that is mine [13:11:58] !log zfilipin@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.19 (duration: 00m 55s) [13:12:02] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=404 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:16] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:16:42] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,400} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method [13:17:21] (03PS2) 10Elukey: Add more tunables to Eventlogging to Druid [puppet] - 10https://gerrit.wikimedia.org/r/531046 [13:19:04] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17972/" [puppet] - 10https://gerrit.wikimedia.org/r/531046 (owner: 10Elukey) [13:19:54] (03CR) 10Jbond: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/531264 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [13:20:07] (03Abandoned) 10Jbond: prometheus: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531264 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [13:20:26] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:20:57] (03Abandoned) 10Jbond: lvs::balancer: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531244 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [13:21:59] (03Abandoned) 10Jbond: installserver: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531237 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [13:22:27] (03PS4) 10CDanis: deployment: add fix-staging-perms command & sudo for it [puppet] - 10https://gerrit.wikimedia.org/r/531291 [13:22:31] (03PS2) 10Jbond: logstash: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531243 (https://phabricator.wikimedia.org/T102099) [13:23:33] (03CR) 10CDanis: [C: 03+2] deployment: add fix-staging-perms command & sudo for it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531291 (owner: 10CDanis) [13:24:04] (03PS1) 10Tchanders: Enable special mute on beta for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531484 [13:24:28] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,400} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method [13:25:24] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [13:26:36] (03CR) 10Jbond: [C: 03+2] logstash: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531243 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [13:26:52] (03CR) 10Dmaza: [C: 03+1] Enable special mute on beta for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531484 (owner: 10Tchanders) [13:27:31] (03PS5) 10CDanis: deployment: add fix-staging-perms command & sudo for it [puppet] - 10https://gerrit.wikimedia.org/r/531291 [13:27:36] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [13:27:43] (03CR) 10CDanis: [V: 03+2 C: 03+2] deployment: add fix-staging-perms command & sudo for it [puppet] - 10https://gerrit.wikimedia.org/r/531291 (owner: 10CDanis) [13:28:39] (03PS3) 10Elukey: Add more tunables to Eventlogging to Druid [puppet] - 10https://gerrit.wikimedia.org/r/531046 [13:28:42] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add more tunables to Eventlogging to Druid [puppet] - 10https://gerrit.wikimedia.org/r/531046 (owner: 10Elukey) [13:29:57] (03PS2) 10Jbond: swift - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531251 (https://phabricator.wikimedia.org/T102099) [13:31:17] (03CR) 10Jbond: [C: 03+2] swift - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531251 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [13:33:18] (03CR) 10Catrope: [C: 03+2] Enable special mute on beta for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531484 (owner: 10Tchanders) [13:34:11] (03PS1) 10Nuria: Removing loading of Reading_Depth into druid [puppet] - 10https://gerrit.wikimedia.org/r/531489 (https://phabricator.wikimedia.org/T229042) [13:34:40] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [13:35:06] (03Merged) 10jenkins-bot: Enable special mute on beta for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531484 (owner: 10Tchanders) [13:37:12] (03PS2) 10Nuria: Removing loading of Reading_Depth into druid [puppet] - 10https://gerrit.wikimedia.org/r/531489 (https://phabricator.wikimedia.org/T229042) [13:37:24] (03CR) 10CDanis: [C: 03+1] swift - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531252 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [13:37:58] (03PS2) 10Jbond: swift - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531252 (https://phabricator.wikimedia.org/T102099) [13:39:22] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [13:40:46] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:43:32] 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10MoritzMuehlenhoff) Ack, let me know when you have found a suitable value for GC_ROOT_BUFFER_MAX_ENTRIES, I have the 7.2.21 update for s... [13:43:50] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:44:02] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [13:44:19] (03CR) 10Jbond: [C: 03+2] swift - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531252 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [13:47:08] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [13:48:34] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:50:29] (03CR) 10Gehel: [C: 03+1] "@Smalyshev: do you have a list of URLs that can be used to test this change?" [puppet] - 10https://gerrit.wikimedia.org/r/526755 (owner: 10Smalyshev) [13:51:48] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [13:51:57] (03CR) 10jenkins-bot: Enable special mute on beta for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531484 (owner: 10Tchanders) [13:53:14] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:53:35] I think I haven't rebased my patch in deploy1001. Has anyone deployed anything since the SWAT? [13:54:14] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) It looks to me like all of this log output is actually from celery starting back up. I wo... [13:55:25] o/ elukey [13:55:41] How did you notice that celery was down on those nodes? [13:56:42] halfak: o/ I opened a task about it, we have an alarm on all nodes running systemd that alerts when units are failed [13:56:45] (one or more) [13:57:03] Gotcha. Thank you for looking into it. I'm writing a task now about putting monitoring in place. [14:00:04] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey) [14:01:02] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey) >>! In T230917#5428548, @Halfak wrote: > It looks to me like all of this log output is actua... [14:01:50] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10elukey) [14:03:27] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) On ores1002, I see the following in app.log: ` 2019-08-21 11:31:10,673 ERROR celery.worker.... [14:05:22] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) I see the same error on ores1006. But celery is clearly still running there. [14:05:55] elukey, just to confirm, you did not restart celery on any nodes other than ores100[2,4,5] right? [14:06:26] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=404 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:06:37] halfak: correct [14:07:20] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [14:07:27] Aha! It looks like some of the machines eventually hit a "timeout" error talking to redis -- which is recoverable. [14:07:39] ah nice! [14:07:42] where is the app.log? [14:07:47] I tried to find it but didn't [14:08:01] /srv/log/ores/app.log [14:08:06] ahhh [14:08:10] I always forget [14:08:27] Cc: jbond42 since he was working on it too [14:08:31] (as FYI) [14:09:42] thanks [14:10:39] !log Upgrade mysql on db2075 [14:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:08] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=404 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:17:22] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=404 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:19:32] ok php7 404s performance during deploys looks like is terrible [14:20:13] I'm leaning towards further restricting the alerts to status <400, not sure if there's any actionable now otherwise [14:20:29] Why would 4xx raise post-deploy? [14:20:34] talking about https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1566386427498&to=1566397227498&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=404&panelId=9&fullscreen [14:20:38] that I have no idea [14:20:46] (03PS2) 10Elukey: Swap analytics-tool1002 with an-tool1007 in caching config [puppet] - 10https://gerrit.wikimedia.org/r/531154 (https://phabricator.wikimedia.org/T230709) [14:21:04] Krinkle: not the count of 4xx but their latency to be exact [14:22:00] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=404 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:22:46] (03CR) 10Elukey: [C: 03+2] Swap analytics-tool1002 with an-tool1007 in caching config [puppet] - 10https://gerrit.wikimedia.org/r/531154 (https://phabricator.wikimedia.org/T230709) (owner: 10Elukey) [14:26:36] Krinkle: which browser did you say to use for mediawiki-new-errors? Chrome? [14:26:51] for editing that dashboard [14:27:36] !log installing ca-certificates-java update from Stretch point release [14:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:14] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=404 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:28:50] !log swap turnilo backend in varnish from analytics-tool1002 to an-tool1007 [14:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:58] new version of turnilo :) [14:29:12] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor) [14:29:26] (03CR) 10jerkins-bot: [V: 04-1] Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor) [14:29:43] !log silence average mw appserver latency alerts for 24h, too noisy [14:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:28] zeljkof: yeah, when I said that it only worked properly in Chrome [14:30:48] I think Firefox quantum (65 and later) can handle it as well [14:31:11] Krinkle: thanks, I was using firefox 68 but it was really slow, chrome works better [14:31:36] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) But on ores1006, the top-level error is: ` redis.exceptions.TimeoutError: Timeout reading f... [14:33:11] (03PS1) 10Urbanecm: Revert "Revert "Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531501 (https://phabricator.wikimedia.org/T230797) [14:33:29] (03PS1) 10Ayounsi: Depool codfw for routers work [dns] - 10https://gerrit.wikimedia.org/r/531502 (https://phabricator.wikimedia.org/T226422) [14:34:44] Krinkle: I've just made the first edit to mediawiki-new-errors, from https://logstash.wikimedia.org/goto/62dcb0a9efdf79f2d14d5bd806962174 to https://logstash.wikimedia.org/goto/e480f1755ad9183ea62c20e6ef7cd424, hopefully I didn't break anything [14:35:35] (03CR) 10Urbanecm: [C: 04-1] "Rebase, please:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor) [14:36:23] !log installing dns-root-data update from Stretch point release [14:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:14] 10Operations, 10MediaWiki-General: Elevated php7 latency during mw deploy - https://phabricator.wikimedia.org/T230934 (10fgiunchedi) [14:39:22] filed ^ btw, didn't seem quite normal [14:42:14] !log installing java-common update from Stretch point release [14:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:35] Urbanecm: hi, around? [14:45:42] Daimona: yes [14:46:11] !log elukey@deploy1001 Started deploy [analytics/superset/deploy@868635a]: Upgrading superset to 0.34rc1 [14:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:15] Good! Would it be possible for you to grep all existing on-wiki scripts? [14:46:44] !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@868635a]: Upgrading superset to 0.34rc1 (duration: 00m 33s) [14:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:25] Daimona: certainly [14:47:38] Cool [14:47:51] I'd like to know if any script contains: abuseFilterBoxName [14:48:25] Daimona: I guess no need to paste that as an NDA-only paste, right? [14:48:35] Indeed :) [14:48:51] It's a JS global that was removed [14:48:59] I don't think anything is using that, but better safe [14:49:05] thanks [14:49:21] Thank you :) [14:49:44] https://www.irccloud.com/pastebin/x3C2hpyC/ [14:49:53] seems you're right Daimona :)) [14:50:06] Eheh thanks :D [14:50:44] yw [14:52:56] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531280 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [14:53:37] (03PS2) 10Jbond: wqds: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531280 (https://phabricator.wikimedia.org/T102099) [14:54:21] (03CR) 10Jbond: [C: 03+2] wqds: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531280 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [14:59:37] (03CR) 10Giuseppe Lavagetto: [C: 03+1] MW servers - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531453 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:00:24] (03PS2) 10Jbond: MW servers - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531453 (https://phabricator.wikimedia.org/T102099) [15:00:46] !log adding interface::add_ip6_mapped to media wiki servers [15:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:27] (03CR) 10Jbond: [C: 03+2] MW servers - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531453 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:05:52] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: remove legacy php restart script [puppet] - 10https://gerrit.wikimedia.org/r/531508 [15:05:53] (03PS1) 10Giuseppe Lavagetto: safe-service-restart: add the ability to just depool or repool a server [puppet] - 10https://gerrit.wikimedia.org/r/531509 [15:05:55] (03PS1) 10Giuseppe Lavagetto: conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 [15:06:23] 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff) [15:07:18] !log installing python-cryptography update from Stretch point release [15:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:43] (03CR) 10jerkins-bot: [V: 04-1] safe-service-restart: add the ability to just depool or repool a server [puppet] - 10https://gerrit.wikimedia.org/r/531509 (owner: 10Giuseppe Lavagetto) [15:08:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: remove legacy php restart script [puppet] - 10https://gerrit.wikimedia.org/r/531508 (owner: 10Giuseppe Lavagetto) [15:08:47] (03CR) 10jerkins-bot: [V: 04-1] conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 (owner: 10Giuseppe Lavagetto) [15:09:26] (03PS1) 10Ayounsi: Varnish: redirect eqsin/ulsfo text to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/531513 (https://phabricator.wikimedia.org/T226422) [15:09:58] (03PS2) 10Jbond: grafana: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531230 (https://phabricator.wikimedia.org/T102099) [15:11:06] (03CR) 10Jbond: [C: 03+2] grafana: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531230 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:11:49] (03PS5) 10Effie Mouzeli: mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) [15:12:58] (03Abandoned) 10Mholloway: Machine vision (beta): Configure Wikidata Beta item URL template [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530575 (owner: 10Mholloway) [15:13:22] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) (owner: 10Effie Mouzeli) [15:14:56] (03PS2) 10Jbond: debug_proxy: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531231 (https://phabricator.wikimedia.org/T102099) [15:15:02] (03CR) 10Gehel: [C: 03+1] maps - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531246 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:15:38] !log elukey@deploy1001 Started deploy [analytics/superset/deploy@UNKNOWN]: Rollback to 0.32 [15:15:41] (03CR) 10Jbond: [C: 03+2] debug_proxy: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531231 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:03] !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@UNKNOWN]: Rollback to 0.32 (duration: 00m 25s) [15:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:20] (03PS6) 10Effie Mouzeli: mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) [15:18:12] 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff) [15:18:50] (03PS2) 10CRusnov: netbox: Make host private and add exception on not found [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072) [15:20:16] (03CR) 10CRusnov: "Fixed!" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [15:21:38] (03PS2) 10Jbond: maps - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531245 (https://phabricator.wikimedia.org/T102099) [15:21:50] (03CR) 10Effie Mouzeli: [V: 03+1] "Expected output https://puppet-compiler.wmflabs.org/compiler1002/17977/" [puppet] - 10https://gerrit.wikimedia.org/r/531474 (owner: 10Effie Mouzeli) [15:22:53] (03CR) 10Effie Mouzeli: [V: 03+1] "Expected https://puppet-compiler.wmflabs.org/compiler1002/17976/" [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) (owner: 10Effie Mouzeli) [15:27:09] (03PS11) 10Ladsgroup: mediawiki: Use mediawiki::errorpage instead of a hhvm-fatal-error.php.erb [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) [15:28:50] Anyone to review this ^ [15:31:02] (03CR) 10Smalyshev: "https://commons.wikimedia.org/entity/statement/M40538870-D3B1F1D8-C2E4-4C7B-B562-721D6C94CF25 should be a good one." [puppet] - 10https://gerrit.wikimedia.org/r/526755 (owner: 10Smalyshev) [15:35:38] (03CR) 10Jbond: [V: 03+2 C: 03+2] maps - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531245 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:36:56] (03PS2) 10Ayounsi: Depool codfw and eqsin for codfw routers work [dns] - 10https://gerrit.wikimedia.org/r/531502 (https://phabricator.wikimedia.org/T226422) [15:38:37] (03CR) 10Jbond: [C: 03+2] maps - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531246 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:38:47] (03PS2) 10Jbond: maps - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531246 (https://phabricator.wikimedia.org/T102099) [15:39:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki::users: Allow adding privileges to mwdeploy user [puppet] - 10https://gerrit.wikimedia.org/r/531474 (owner: 10Effie Mouzeli) [15:41:24] (03PS2) 10Jbond: graphite: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531236 (https://phabricator.wikimedia.org/T102099) [15:42:24] (03PS2) 10Giuseppe Lavagetto: safe-service-restart: add the ability to just depool or repool a server [puppet] - 10https://gerrit.wikimedia.org/r/531509 [15:42:26] (03PS2) 10Giuseppe Lavagetto: conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 [15:43:32] (03CR) 10Jbond: [C: 03+2] graphite: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531236 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:43:58] (03CR) 10jerkins-bot: [V: 04-1] conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 (owner: 10Giuseppe Lavagetto) [15:44:37] (03PS1) 10CRusnov: netbox: Add method to return host information [software/spicerack] - 10https://gerrit.wikimedia.org/r/531521 [15:45:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki::common: Allow mwdeploy user to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/531475 (https://phabricator.wikimedia.org/T224857) (owner: 10Effie Mouzeli) [15:46:15] (03CR) 10CDanis: dbctl: add note & candidate_master fields (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/529396 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [15:47:36] (03CR) 10Joal: [C: 03+1] "@nuria: The patch has been merged on June 26 :)" [puppet] - 10https://gerrit.wikimedia.org/r/519181 (https://phabricator.wikimedia.org/T226035) (owner: 10Elukey) [15:48:04] (03PS2) 10Jbond: webserver_misc_apps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531239 (https://phabricator.wikimedia.org/T102099) [15:49:43] (03CR) 10Jbond: [C: 03+2] webserver_misc_apps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531239 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:53:00] (03PS2) 10Jbond: openldap: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531242 (https://phabricator.wikimedia.org/T102099) [15:53:43] (03CR) 10Jbond: [C: 03+2] openldap: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531242 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:56:05] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [15:56:30] (03CR) 10CRusnov: [C: 03+2] netbox: Make host private and add exception on not found [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [15:58:07] (03CR) 10Jbond: [C: 03+2] swap: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531258 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:58:35] (03PS2) 10Jbond: swap: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531258 (https://phabricator.wikimedia.org/T102099) [16:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:44] (03Merged) 10jenkins-bot: netbox: Make host private and add exception on not found [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [16:01:11] !log fixed apt config on krypton, broken getenvoy-jessie.list made apt-get update fail [16:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:53] (03CR) 10jenkins-bot: netbox: Make host private and add exception on not found [software/spicerack] - 10https://gerrit.wikimedia.org/r/531331 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [16:16:58] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:17:40] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:18:33] moritzm: what does r531275 do? [16:18:48] https://gerrit.wikimedia.org/r/c/operations/puppet/+/531275 [16:19:31] or jbond42 ^^^ [16:20:22] urandom: currently theses servers have a SLAAC ipv6 address which looks like ipv6 = $prefix:$mac_address. this change updates the server so that it will have ipv6 = $prefix:$ipv4 [16:20:41] inbound connections should be unaffected as the address is not in dns [16:20:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:21:07] jbond42: I see [16:21:13] outbound connections would get a different ipv6 source address however this should not causes issues as the SLAAC address is not configuered anywhere (well no where in puppet) [16:21:38] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:21:39] i have allready deployed to a number of services and im pretty confident it should cause no issues [16:21:40] (03CR) 10Eevans: [C: 03+1] sessionstore: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531275 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [16:21:50] thanks :) [16:26:40] apparently the 5xx spike was brief, but it seems to have affected eqsin, ulsfo, and codfw, but not eqiad or esams. [16:26:56] (eqsin and ulsfo flow through codfw, forming one side of the world so to speak) [16:28:35] (03PS1) 10Elukey: profile::superset::proxy: add X-Forwarded-Proto "http" [puppet] - 10https://gerrit.wikimedia.org/r/531526 (https://phabricator.wikimedia.org/T230416) [16:30:03] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230442 (10Jclark-ctr) Received replacement SSD 1.9t {F30052557} [16:30:14] (03CR) 10Elukey: [C: 03+2] profile::superset::proxy: add X-Forwarded-Proto "http" [puppet] - 10https://gerrit.wikimedia.org/r/531526 (https://phabricator.wikimedia.org/T230416) (owner: 10Elukey) [16:36:02] (03CR) 10Ayounsi: [C: 03+2] Depool codfw and eqsin for codfw routers work [dns] - 10https://gerrit.wikimedia.org/r/531502 (https://phabricator.wikimedia.org/T226422) (owner: 10Ayounsi) [16:37:04] !log depool eqsin and codfw - T226422 [16:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:11] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [16:38:27] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/531521 (owner: 10CRusnov) [16:41:40] howdy, anyone up for deploying a small backport? [16:42:05] (03CR) 10CRusnov: [C: 03+2] netbox: Add method to return host information [software/spicerack] - 10https://gerrit.wikimedia.org/r/531521 (owner: 10CRusnov) [16:43:18] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230442 (10Bstorm) Did Dell only send replacement SSD? This has lost 4 disks in a very short time (all are failed now and most missing in the list of disks). I highly suspect there is... [16:44:24] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 55.44 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:45:58] that's expected ^ [16:46:02] (03Merged) 10jenkins-bot: netbox: Add method to return host information [software/spicerack] - 10https://gerrit.wikimedia.org/r/531521 (owner: 10CRusnov) [16:46:26] !log apply BGP graceful shutdown to cr1-codfw transits - T226422 [16:46:28] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 52.99 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:32] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [16:47:02] (03CR) 10jenkins-bot: netbox: Add method to return host information [software/spicerack] - 10https://gerrit.wikimedia.org/r/531521 (owner: 10CRusnov) [16:48:04] Amir1: so, you're deploying https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/531528 ? Asking as there is another volunteer, so we want to avoid stepping on each other's toes [16:51:12] !log increase OSPF cost on ulsfo-codfw link - T226422 [16:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:13] (03CR) 10Ayounsi: [C: 03+2] Varnish: redirect eqsin/ulsfo text to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/531513 (https://phabricator.wikimedia.org/T226422) (owner: 10Ayounsi) [16:55:21] (03PS2) 10Ayounsi: Varnish: redirect eqsin/ulsfo text to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/531513 (https://phabricator.wikimedia.org/T226422) [16:56:22] !log Varnish: redirect eqsin/ulsfo text to eqiad - T226422 [16:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:28] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [16:56:46] Amir1: I will do it assuming that you're busy [16:57:01] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Cmjohnson) @Bstorm can you try rebooting the server and see if the disks get back to the correct order. I know that works for analytics. Please tr... [16:57:44] Any objections to me SWATing in a patch, I guess I might blow the window a bit but there is still quite a while until the next calendar entry? [17:00:09] !log continuing the SWAT window to backport train blocker fixes [17:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:22] !log rebooting cloudvirt1024 [17:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:37] (03PS2) 10Jbond: sessionstore: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531275 (https://phabricator.wikimedia.org/T102099) [17:08:47] !log disable BGP from cr1-codfw to lvs2001/2/3 - T226422 [17:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:53] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [17:15:07] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) Yup, I can do that. I'm not sure which either, per T230442#5429068 It dropped the failures from the list, and I'm not even entirely convince... [17:15:25] tarrow: Thanks. UBNs can go in at any time [17:15:34] by definition of UBN [17:15:51] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Cmjohnson) The disk was replaced but from what I can tell is that the raid configuration is not accepting the new disk. When I am in the raid utility... [17:16:45] cool [17:17:01] (03CR) 10Jbond: [C: 03+2] sessionstore: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531275 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [17:17:04] just waiting for our good friend jenkins [17:17:11] !log failover master RE to RE1 on cr1-codfw - T226422 [17:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:16] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [17:17:44] !log reboot cloudvirt1024 to try and reset raid T230289 [17:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:50] T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 [17:20:27] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) It wasn't showing the right number of disks when I was running things. It was missing four, I believe? Two have failed and logged tickets,... [17:21:26] linecards restarted as expected and coming online [17:22:24] (03PS2) 10Jbond: thumbor: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531278 (https://phabricator.wikimedia.org/T102099) [17:23:06] (03CR) 10Jbond: [C: 03+2] thumbor: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531278 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [17:24:19] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) Reboot sent it into a re-image (stalled at confirmation about writing partitioning scheme to disk). It's not healthy. :) Feel free to muck... [17:24:38] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:25:03] XioNoX: is this you ^^ [17:25:16] yes [17:25:19] eqsin is depooled [17:25:26] ack ok thx [17:25:26] probably due to the transport link flap [17:25:50] !log shutdown RE0 on cr1-codfw - T226422 [17:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:56] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [17:28:17] (03PS2) 10Jbond: etherpad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531225 (https://phabricator.wikimedia.org/T102099) [17:29:18] (03CR) 10Jbond: [C: 03+2] etherpad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531225 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [17:29:18] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:33:28] !log cloudvirt1015 down for a new motherboard [17:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:33] !log failover master RE to RE0 on cr1-codfw - T226422 [17:33:40] (03Abandoned) 10Jbond: spare::system: add ipv6 mapped addres [puppet] - 10https://gerrit.wikimedia.org/r/531157 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [17:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:42] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [17:34:45] Not ready for mastership switch, try after 107 secs. [17:34:46] haha [17:35:54] (03PS2) 10Jbond: logging::mediawiki::udp2log: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531257 (https://phabricator.wikimedia.org/T102099) [17:36:44] PROBLEM - Host cloudvirt1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:39:10] bah, RE0 didn't pickup the config from RE1 [17:42:05] (03CR) 10Cmjohnson: [C: 03+2] Removing mgmt dns for californium [dns] - 10https://gerrit.wikimedia.org/r/531295 (https://phabricator.wikimedia.org/T189921) (owner: 10Cmjohnson) [17:42:12] solved [17:42:17] (03PS2) 10Cmjohnson: Removing mgmt dns for californium [dns] - 10https://gerrit.wikimedia.org/r/531295 (https://phabricator.wikimedia.org/T189921) [17:42:27] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10Nuria) 05Open→03Resolved [17:42:29] warning: Chassis configuration for network services has been changed. A system reboot is mandatory. Please reboot *ALL* routing engines NOW. Continuing without a reboot might result in unexpected system behavior. [17:42:30] 10Operations, 10Analytics, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Nuria) [17:42:36] (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] Removing mgmt dns for californium [dns] - 10https://gerrit.wikimedia.org/r/531295 (https://phabricator.wikimedia.org/T189921) (owner: 10Cmjohnson) [17:42:38] that was not in the docs [17:42:47] (03CR) 10Cmjohnson: [C: 03+2] Removing mgmt dns entries for logstash1000[4-6] [dns] - 10https://gerrit.wikimedia.org/r/531293 (https://phabricator.wikimedia.org/T217556) (owner: 10Cmjohnson) [17:42:51] (03PS2) 10Cmjohnson: Removing mgmt dns entries for logstash1000[4-6] [dns] - 10https://gerrit.wikimedia.org/r/531293 (https://phabricator.wikimedia.org/T217556) [17:42:55] (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] Removing mgmt dns entries for logstash1000[4-6] [dns] - 10https://gerrit.wikimedia.org/r/531293 (https://phabricator.wikimedia.org/T217556) (owner: 10Cmjohnson) [17:43:02] !log restart both REs on cr1-codfw - T226422 [17:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:07] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [17:43:31] (03CR) 10Jbond: [C: 03+2] logging::mediawiki::udp2log: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531257 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [17:45:20] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 71.11 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:45:45] wow, so quick to reboot [17:47:22] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 70.34 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:48:02] PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 64, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:48:04] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:49:52] (03CR) 10DannyS712: [C: 03+1] "Thanks for the follow up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531501 (https://phabricator.wikimedia.org/T230797) (owner: 10Urbanecm) [17:50:14] linecards take much more time to come back up [17:50:23] but they are up now [17:50:36] everything looks good, starting to repool everything [17:50:50] I mean make cr1 primary for everything [17:51:10] RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 66, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:51:39] 10Operations, 10DBA, 10StructuredDiscussions, 10Growth-Team (Current Sprint), 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10JTannerWMF) a:03Tgr [17:52:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:53:23] !log rollback: disable BGP from cr1-codfw to lvs2001/2/3 - T226422 [17:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:29] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [17:55:03] !log Rollback: increase OSPF cost on ulsfo-codfw link - T226422 [17:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:04] !log rollback: apply BGP graceful shutdown to cr1-codfw transits - T226422 [17:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:34] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) copied from T230442#5413070 ` Versions ================ Product Name : PERC H730P Adapter Serial No... [17:57:04] tarrow: I've got a train blocker to deploy once you're done, BTW. [17:57:32] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:58:37] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10Jclark-ctr) [17:59:21] James_F: awesome, doing this blocker first [17:59:27] just a moment :) [17:59:35] 10Operations, 10ops-eqiad, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10Jclark-ctr) [17:59:43] Of course, no rush. :-) [18:00:11] We just waited 1hr on jenkins... :P [18:00:19] 10Operations, 10ops-eqiad, 10decommission: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10Jclark-ctr) [18:00:34] Yeah, it's almost like Wikidata has too many complex tests. ;-P [18:00:38] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:02:46] James_F: I see some up checked in changes in .19 am I still good to do a backport? [18:03:19] specifically to includes/resourceloader/ResourceLoaderWikiModule.php [18:03:54] not security commits or something but I guess an actual manually tweaked file [18:03:59] * James_F looks. [18:04:02] there have been chmod issues [18:04:23] which can make a file modified that isn't due to git's internal state not being reflected on disk [18:04:32] !log increase OSPF cost on cr2-codfw links - T226422 [18:04:34] ah! right [18:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:37] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [18:04:58] Krinkle: so should I be waiting until that's fixed? Or should I go ahead anyway? [18:05:01] tarrow: fixed. Yes, there are still chmod issues it seems. [18:05:19] ahl cool :) [18:08:07] James_F: at the risk of asking a stupid question it looks like somehow you back port is ahead of mine... [18:08:30] can I get you to look before I break something? [18:08:45] * James_F looks. [18:09:05] Yours is 1cc697615c3bd13bc3a46fffdb1da3e94cccc9cc, mine is "behind" that. [18:09:46] If you scap "just" the ext/Wikibase dir it'll be fine from my POV. [18:09:54] 10Operations, 10Elasticsearch, 10Traffic, 10Discovery-Search (Current work), 10Patch-For-Review: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10debt) 05Open→03Resolved [18:09:56] ok :) [18:10:56] checking [18:11:37] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) https://www.dell.com/support/home/en/en/sebsdt1/drivers/driversdetails?driverid=f675y Looks like there's a number of fixes on this update of... [18:12:15] tarrow: looks good [18:12:20] great! [18:12:25] !log deactivate transit links on cr2-codfw - T226422 [18:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:30] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [18:14:33] syncing now... [18:14:45] !log ayounsi@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [18:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:52] !log ayounsi@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=swift,name=codfw [18:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:20] !log tarrow@deploy1001 Synchronized php-1.34.0-wmf.19/extensions/Wikibase/client/: SWAT: [[gerrit:531528|Use the backwards-compatible HTML ID for the wikidata item link (T66315)]] (duration: 00m 58s) [18:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:26] T66315: Move "Data item" link outside of sidebar toolbox - https://phabricator.wikimedia.org/T66315 [18:15:47] !log ayounsi@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [18:15:50] tarrow: also looking good on non-debug host [18:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:53] James_F: we're done :) Thanks for the help [18:15:53] !log ayounsi@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=swift,name=codfw [18:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:00] tarrow: muchos gracias [18:16:18] Always. On behalf of RelEng, thanks for fixing. :-) [18:16:57] We have one more UBN to fix but I think we will do it tomorrow morning with fresh eyes so we don't make things worse [18:17:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Cmjohnson) Board arrived DOA...need another one [18:17:53] !log move VRRP master from cr2-codfw to cr1-codfw - T226422 [18:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:11] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [18:18:55] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.19/includes/specialpage/RedirectSpecialPage.php: T230932 RedirectSpecialArticle: Fix PHP notice about undefined index (duration: 00m 54s) [18:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:00] T230932: RedirectSpecialArticle.php: PHP Notice: Undefined index: action - https://phabricator.wikimedia.org/T230932 [18:19:26] !log shutdown re1:cr2-codfw (backup) - T226422 [18:19:28] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:39] tarrow: Is that part of T230937 or is that a different task? [18:21:39] T230937: TermboxView.php: Call to a member function getSerialization() on a non-object (null) - https://phabricator.wikimedia.org/T230937 [18:30:44] waiting for RE1 to be 100% online [18:31:04] jouncebot: next [18:31:05] In 1 hour(s) and 28 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T2000) [18:32:23] !log failover master RE to RE1 on cr2-codfw - T226422 [18:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:29] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [18:32:46] PROBLEM - Juniper alarms on cr2-codfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.193 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [18:33:24] RECOVERY - Host cloudvirt1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [18:36:30] PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 64, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:37:03] !log shutdown re0:cr2-codfw (backup) - T226422 [18:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:56] RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 66, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:44:33] James_F: sorry just got back to the hotel. Yes it is a solution for T230937. Patch is up but we didn't backport it yet. I think I'd be happiest to have slept on it/merge it tomorrow [18:44:34] T230937: TermboxView.php: Call to a member function getSerialization() on a non-object (null) - https://phabricator.wikimedia.org/T230937 [18:48:26] 10Operations: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10sbassett) [18:49:08] 10Operations, 10Wikimedia-Mailing-lists: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10Reedy) [18:49:23] 10Operations, 10netops: PCI Gap Assessment auditor question about SNMP - https://phabricator.wikimedia.org/T230952 (10Jgreen) [18:52:20] 10Operations, 10netops: PCI Gap Assessment auditor question about SNMP - https://phabricator.wikimedia.org/T230952 (10ayounsi) 05Open→03Resolved a:03ayounsi SNMPv2c Read Only. Easy task! :) [19:14:13] !log failover master RE to RE0 on cr2-codfw - T226422 [19:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:20] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [19:16:25] !log restart both REs on cr2-codfw - T226422 [19:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:54] PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 64, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:19:14] waiting for linecards to bootup [19:19:20] but so far everything looks healthy [19:22:14] PROBLEM - PyBal BGP sessions are established on lvs2005 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=codfw+prometheus/ops [19:22:34] RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 66, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:22:38] RECOVERY - Juniper alarms on cr2-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [19:22:48] alright all good [19:23:13] things are re-establishing as expected [19:23:19] devices look healthy [19:23:46] RECOVERY - PyBal BGP sessions are established on lvs2005 is OK: (C)0 le (W)0 le 1 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=codfw+prometheus/ops [19:24:53] !log rollback: move VRRP master from cr2-codfw to cr1-codfw - T226422 [19:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:59] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [19:25:40] !log rollback deactivate transit links on cr2-codfw - T226422 [19:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:14] !log rollback: increase OSPF cost on cr2-codfw links - T226422 [19:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:16] (03PS1) 10Ayounsi: Revert "Varnish: redirect eqsin/ulsfo text to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/531535 [19:29:29] !log ayounsi@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=swift,name=codfw [19:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:40] !log ayounsi@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [19:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:17] (03CR) 10Ayounsi: [C: 03+2] Revert "Varnish: redirect eqsin/ulsfo text to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/531535 (owner: 10Ayounsi) [19:30:30] (03PS2) 10Ayounsi: Revert "Varnish: redirect eqsin/ulsfo text to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/531535 [19:30:48] (03PS1) 10Ayounsi: Revert "Depool codfw and eqsin for codfw routers work" [dns] - 10https://gerrit.wikimedia.org/r/531536 [19:31:41] !log Rollback: Varnish: redirect eqsin/ulsfo text to eqiad - T226422 [19:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:47] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [19:32:22] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool codfw and eqsin for codfw routers work" [dns] - 10https://gerrit.wikimedia.org/r/531536 (owner: 10Ayounsi) [19:32:27] (03PS2) 10Ayounsi: Revert "Depool codfw and eqsin for codfw routers work" [dns] - 10https://gerrit.wikimedia.org/r/531536 [19:34:38] !log repool codfw and eqsin - T226422 [19:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:05] 10Operations, 10ops-codfw, 10netops: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 (10ayounsi) 05Open→03Resolved DONE! Everything is healthy, very little alert noise, no service impact. [19:40:02] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:51] chaomodus: ^ https://netbox.wikimedia.org/extras/reports/librenms.LibreNMS/run/ returns a server error [19:52:12] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@adff5ad]: bulk_daemon: Track timeouts, log indices used, increase thread counts [19:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:28] we replaced the routing engines in codfw and updated them in netbox as well, so I guess it's related [19:54:09] 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10ayounsi) [19:54:45] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@adff5ad]: bulk_daemon: Track timeouts, log indices used, increase thread counts (duration: 02m 34s) [19:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:39] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@556c4d0]: bulk_daemon: Track timeouts, log indices used, increase thread counts [19:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: Dear deployers, time to do the Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T2000). [20:00:26] !log test l3 ECMP in ulsfo [20:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:21] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@556c4d0]: bulk_daemon: Track timeouts, log indices used, increase thread counts (duration: 04m 42s) [20:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:01] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@67103e9]: bulk_daemon: Correct super() call [20:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:01] 10Operations, 10Traffic: Configure Layer3 hashing for router ECMP (for anycast DNS) - https://phabricator.wikimedia.org/T230955 (10BBlack) [20:17:20] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@67103e9]: bulk_daemon: Correct super() call (duration: 04m 19s) [20:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:26] 10Operations, 10Core Platform Team, 10Performance-Team, 10TechCom-RFC, and 4 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10kchapman) [20:26:54] 10Operations, 10Core Platform Team, 10Performance-Team, 10TechCom-RFC, and 4 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10kchapman) @CCicalese_WMF could you review this from a product perspective and determine if it is something we want to do? [20:38:56] (03CR) 10Viztor: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor) [20:39:09] (03CR) 10jerkins-bot: [V: 04-1] Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor) [20:42:02] (03PS6) 10Viztor: Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 [20:42:58] (03CR) 10jerkins-bot: [V: 04-1] Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor) [20:45:14] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Slaporte) >>! In T204056#5261399, @tramm wrote: > ... Wikimedia Eesti doesn't directly control any nameservers (however we control many DNS records of domains... [20:48:06] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@fc270fd]: bulk_daemon: Retune popularity_score bulk sizing [20:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:31] (03PS7) 10Viztor: Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 [20:51:55] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@fc270fd]: bulk_daemon: Retune popularity_score bulk sizing (duration: 03m 49s) [20:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:08] (03CR) 10jerkins-bot: [V: 04-1] Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor) [21:10:50] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10JAufrecht) > title, subtitle and description I think what you have now, "Wikimedia Tech... [21:30:18] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Aklapper) General reminder about [naming things](https://www.mediawiki.org/wiki/Naming_th... [21:47:18] PROBLEM - SSH on labstore1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:48:44] RECOVERY - SSH on labstore1006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:00:10] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [22:01:40] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [22:38:34] (03CR) 10DannyS712: [C: 03+1] "Looks good to me, pending deployment of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/530014/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529924 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [22:39:30] (03CR) 10DannyS712: [C: 03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530769 (https://phabricator.wikimedia.org/T230680) (owner: 10MarcoAurelio) [22:40:02] 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10ayounsi) Scheduled for Thursday Sept 5th, 8am PST, 11am local time, 15:00 UTC. 3h [22:52:22] 10Operations, 10Traffic, 10netops: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi) [22:52:25] 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10ayounsi) [22:52:28] 10Operations, 10ops-codfw, 10netops: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 (10ayounsi) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190821T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS.