[00:00:17] RECOVERY - Host backup2001 is UP: PING OK - Packet loss = 0%, RTA = 36.23 ms [01:59:28] (03PS3) 10Herron: lvs: add entries for logstash-next and kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/554905 (https://phabricator.wikimedia.org/T234854) [02:02:04] (03CR) 10Herron: lvs: add entries for logstash-next and kibana-next (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554905 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [02:09:05] (03PS2) 10Herron: dns: add kibana-next and logstash-next service addresses [dns] - 10https://gerrit.wikimedia.org/r/554906 (https://phabricator.wikimedia.org/T234854) [02:25:21] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6208 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [02:30:10] (03PS1) 10Herron: ganeti: assign ganeti400[123] role::ganeti [puppet] - 10https://gerrit.wikimedia.org/r/555761 (https://phabricator.wikimedia.org/T226444) [02:30:46] (03CR) 10jerkins-bot: [V: 04-1] ganeti: assign ganeti400[123] role::ganeti [puppet] - 10https://gerrit.wikimedia.org/r/555761 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [02:32:06] (03PS2) 10Herron: ganeti: assign ganeti400[123] role::ganeti [puppet] - 10https://gerrit.wikimedia.org/r/555761 (https://phabricator.wikimedia.org/T226444) [02:36:09] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.05833 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [02:37:12] (03PS1) 10Herron: ganeti: disable alerts on ganeti400[123] during setup [puppet] - 10https://gerrit.wikimedia.org/r/555763 (https://phabricator.wikimedia.org/T226444) [02:40:46] (03CR) 10Herron: [C: 03+2] ganeti: disable alerts on ganeti400[123] during setup [puppet] - 10https://gerrit.wikimedia.org/r/555763 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [03:52:44] !log andrew@deploy1001 Started deploy [horizon/deploy@d1cba62]: (no justification provided) [03:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:34] !log andrew@deploy1001 Finished deploy [horizon/deploy@d1cba62]: (no justification provided) (duration: 01m 51s) [03:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:24] !log andrew@deploy1001 Started deploy [horizon/deploy@9847a28]: (no justification provided) [04:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:05:01] !log andrew@deploy1001 Finished deploy [horizon/deploy@9847a28]: (no justification provided) (duration: 03m 37s) [04:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:38] (03PS1) 10Andrew Bogott: horizon: remove PUPPET_GIT_REPO_PATH config [puppet] - 10https://gerrit.wikimedia.org/r/555773 (https://phabricator.wikimedia.org/T239146) [04:18:42] (03PS2) 10Andrew Bogott: horizon: remove PUPPET_GIT_REPO_PATH config [puppet] - 10https://gerrit.wikimedia.org/r/555773 (https://phabricator.wikimedia.org/T239146) [04:22:10] (03CR) 10Andrew Bogott: [C: 03+2] horizon: remove PUPPET_GIT_REPO_PATH config [puppet] - 10https://gerrit.wikimedia.org/r/555773 (https://phabricator.wikimedia.org/T239146) (owner: 10Andrew Bogott) [06:08:06] (03PS2) 10CRusnov: netbox: move to netbox-extras repository [puppet] - 10https://gerrit.wikimedia.org/r/554962 [06:12:53] (03PS3) 10CRusnov: hieradata/netbox: Add accounting report to alerts [puppet] - 10https://gerrit.wikimedia.org/r/550053 [06:18:08] (03Abandoned) 10CRusnov: netbox: Expose automated DNS repository for web access [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [06:18:33] (03CR) 10CRusnov: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [06:19:31] (03Abandoned) 10CRusnov: netbox: Make the report url in report alerts the canonical url [puppet] - 10https://gerrit.wikimedia.org/r/551935 (owner: 10CRusnov) [06:20:12] (03CR) 10CRusnov: "ping for review" [dns] - 10https://gerrit.wikimedia.org/r/541602 (https://phabricator.wikimedia.org/T234997) (owner: 10CRusnov) [06:20:22] (03CR) 10jerkins-bot: [V: 04-1] netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [06:26:09] (03PS2) 10TechneSiyam: Added 1.5x and 2x pngs of wiki project logos that are in SVG. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555730 [06:26:11] (03PS1) 10TechneSiyam: Added 1.5x and 2x logos for wiki project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555775 [06:27:17] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [06:33:08] (03PS6) 10CRusnov: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [06:34:54] (03CR) 10jerkins-bot: [V: 04-1] netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [06:36:16] (03PS1) 10TechneSiyam: Modified InitialiseSettings with WikiHD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555776 [06:40:59] 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Migrate references from $instance.eqiad.wmflabs to $instance.$project.eqiad.wmflabs - https://phabricator.wikimedia.org/T153608 (10Andrew) updated grep is git grep -P '\b(? 10Operations, 10DBA: backup2001 rebooted itself - https://phabricator.wikimedia.org/T240177 (10Marostegui) [06:55:31] (03PS1) 10Andrew Bogott: labs.yaml: amend some comments to add the project name to VM fqdns [puppet] - 10https://gerrit.wikimedia.org/r/555777 (https://phabricator.wikimedia.org/T153608) [06:55:34] (03PS1) 10Andrew Bogott: prometheus: add project to fqdn of tools-proxy-04.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/555778 (https://phabricator.wikimedia.org/T153608) [06:59:26] (03PS2) 10Andrew Bogott: prometheus: change default tools proxy to tools-proxy-05.tools.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/555778 (https://phabricator.wikimedia.org/T153608) [06:59:43] (03CR) 10Andrew Bogott: [C: 03+2] labs.yaml: amend some comments to add the project name to VM fqdns [puppet] - 10https://gerrit.wikimedia.org/r/555777 (https://phabricator.wikimedia.org/T153608) (owner: 10Andrew Bogott) [07:02:09] (03CR) 10Andrew Bogott: [C: 03+2] prometheus: change default tools proxy to tools-proxy-05.tools.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/555778 (https://phabricator.wikimedia.org/T153608) (owner: 10Andrew Bogott) [07:03:17] 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate references from $instance.eqiad.wmflabs to $instance.$project.eqiad.wmflabs - https://phabricator.wikimedia.org/T153608 (10Andrew) 05Open→03Resolved [07:39:59] RECOVERY - Memory correctable errors -EDAC- on mw1239 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [07:44:00] !log resetting cron on wdqs1010 to fix cronspam [07:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:33] (03PS1) 10KartikMistry: Update cxserver to 2019-12-05-090549-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/555784 (https://phabricator.wikimedia.org/T217585) [07:52:39] Planning to update cxserver in few minutes. [07:55:09] (03PS2) 10KartikMistry: Update cxserver to 2019-12-05-090549-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/555784 (https://phabricator.wikimedia.org/T217585) [07:56:04] 10Operations, 10Traffic: Monitor and plot TTFB as seen by Varnish frontends - https://phabricator.wikimedia.org/T240180 (10ema) [07:56:15] 10Operations, 10Traffic: Monitor and plot TTFB as seen by Varnish frontends - https://phabricator.wikimedia.org/T240180 (10ema) p:05Triage→03Normal [07:58:17] 10Operations, 10Performance-Team, 10Traffic: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) Analysis repeated right now capturing requests for 60s. The numbers don't look as bad. p75 TTFB in milliseconds: |**host**| **hit** | **miss** | **pass**... [08:07:15] 10Operations, 10Data-Services, 10Discovery-Search, 10Wikidata, and 3 others: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10ArielGlenn) Repeating here some things from a chort chat in irc: The original limits were set because we had one host doing all of - we... [08:08:27] (03CR) 10ArielGlenn: "I've commented on the ticket with a bit of the backstory and a pointer to the ticket opened at the time of the move. I've no opinion on wh" [puppet] - 10https://gerrit.wikimedia.org/r/555632 (https://phabricator.wikimedia.org/T222349) (owner: 10Bstorm) [08:23:54] 10Operations, 10Traffic: Investigate trafficserver-tls crash on cp3064 - https://phabricator.wikimedia.org/T240183 (10ema) [08:23:58] 10Operations, 10Traffic: Investigate trafficserver-tls crash on cp3064 - https://phabricator.wikimedia.org/T240183 (10ema) p:05Triage→03Normal [08:24:54] !log cp3064: ats-tls-restart to clear "tls process restarted" alert T240183 [08:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:00] T240183: Investigate trafficserver-tls crash on cp3064 - https://phabricator.wikimedia.org/T240183 [08:26:51] RECOVERY - traffic_server tls process restarted on cp3064 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams+prometheus/ops&var-instance=cp3064&var-layer=tls [08:27:02] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2019-12-05-090549-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/555784 (https://phabricator.wikimedia.org/T217585) (owner: 10KartikMistry) [08:27:16] (03Merged) 10jenkins-bot: Update cxserver to 2019-12-05-090549-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/555784 (https://phabricator.wikimedia.org/T217585) (owner: 10KartikMistry) [08:28:28] (03PS2) 10Muehlenhoff: Switch ldap-corp.eqiad.wikimedia.org to ldap-corp1001 [dns] - 10https://gerrit.wikimedia.org/r/554852 (https://phabricator.wikimedia.org/T224557) [08:33:27] !log powercycle mw1280, mgmt console stuck, dimm errors in getsel [08:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:27] RECOVERY - Host mw1280 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [08:38:03] 10Operations, 10Performance-Team, 10Traffic: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) We're still seeing an extra 100-150ms on the p75 TTFB reported by clients in Europe compared to before 11/11. Only 20ms of which can be attributed to TLS. [08:39:48] !log kartik@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [08:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:11] !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [08:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:24] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555730 (owner: 10TechneSiyam) [08:41:59] (03CR) 10Urbanecm: [C: 04-1] "InitialiseSettings.php should be modified in another patch, according to https://wikitech.wikimedia.org/wiki/SWAT_deploys." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555730 (owner: 10TechneSiyam) [08:46:02] 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10elukey) [08:46:39] !log kartik@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [08:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:19] !log Updated cxserver to 2019-12-05-090549-production (T217585, T230195) [08:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:25] T230195: MT stops translating "randomly" - https://phabricator.wikimedia.org/T230195 [08:49:25] T217585: CX2: ISBN doubled, one correctly formatted with {{ISBN}}, another incorrectly formatted with [[Special:BookSources]] - https://phabricator.wikimedia.org/T217585 [08:58:37] !log oblivian@cumin1001 conftool action : set/weight=10; selector: service=parsoid-php [08:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:57] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall but I haven't run PCC" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/554853 (https://phabricator.wikimedia.org/T224585) (owner: 10Arturo Borrero Gonzalez) [09:06:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/555513 (owner: 10CDanis) [09:09:21] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554905 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [09:09:23] (03CR) 10Volans: [C: 03+1] "LGTM, compiler is happy too:" [puppet] - 10https://gerrit.wikimedia.org/r/554962 (owner: 10CRusnov) [09:10:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/554906 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [09:10:29] (03CR) 10Muehlenhoff: [C: 03+2] Switch ldap-corp.eqiad.wikimedia.org to ldap-corp1001 [dns] - 10https://gerrit.wikimedia.org/r/554852 (https://phabricator.wikimedia.org/T224557) (owner: 10Muehlenhoff) [09:11:20] (03CR) 10Volans: [C: 03+1] "Change LGTM but the report is not green, so I'd say we wait for it to be fixed before enabling Icinga alerts?" [puppet] - 10https://gerrit.wikimedia.org/r/550053 (owner: 10CRusnov) [09:17:57] (03PS8) 10Filippo Giunchedi: install_server: standard recipe and raid1/raid10 [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955) [09:17:59] (03PS8) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) [09:18:01] (03PS5) 10Filippo Giunchedi: WIP role: extend centrallog's /srv if needed [puppet] - 10https://gerrit.wikimedia.org/r/554044 (https://phabricator.wikimedia.org/T156955) [09:21:29] (03CR) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [09:22:31] (03PS9) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) [09:22:33] (03PS6) 10Filippo Giunchedi: WIP role: extend centrallog's /srv if needed [puppet] - 10https://gerrit.wikimedia.org/r/554044 (https://phabricator.wikimedia.org/T156955) [09:23:51] (03CR) 10Filippo Giunchedi: [C: 03+1] traffic drop: require minimum absolute rps [puppet] - 10https://gerrit.wikimedia.org/r/555550 (https://phabricator.wikimedia.org/T239039) (owner: 10CDanis) [09:26:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/554656 (https://phabricator.wikimedia.org/T233448) (owner: 10Cwhite) [09:33:02] 10Operations, 10Maps, 10Discovery-Search (Current work): Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 (10Gehel) [09:37:58] 10Operations, 10Maps, 10Discovery-Search (Current work): Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 (10Mathew.onipe) I'm getting: ` CREATE FUNCTION executing SQL in: /srv/deployment/kartotherian/deploy/node_modules/@ka... [09:53:23] (03PS1) 10Ema: varnishmtail: add origin server logging support [puppet] - 10https://gerrit.wikimedia.org/r/555909 (https://phabricator.wikimedia.org/T240180) [09:58:10] (03PS4) 10Elukey: Enable Kerberos in Hadoop Analytics and Druid Analytics/Public [puppet] - 10https://gerrit.wikimedia.org/r/549566 (https://phabricator.wikimedia.org/T237269) [09:58:47] (03PS6) 10Elukey: statistics::discovery: move cron to timer and add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/554528 [10:06:22] !log rolling restart php-fpm in mw-eqiad due to APCu fragmentation [10:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:18] (03PS2) 10Ema: varnishmtail: add origin server logging support [puppet] - 10https://gerrit.wikimedia.org/r/555909 (https://phabricator.wikimedia.org/T240180) [10:19:20] (03CR) 10Filippo Giunchedi: [C: 03+1] varnishmtail: add origin server logging support [puppet] - 10https://gerrit.wikimedia.org/r/555909 (https://phabricator.wikimedia.org/T240180) (owner: 10Ema) [10:22:49] (03CR) 10Ema: [C: 03+2] varnishmtail: add origin server logging support [puppet] - 10https://gerrit.wikimedia.org/r/555909 (https://phabricator.wikimedia.org/T240180) (owner: 10Ema) [10:23:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [10:28:46] (03CR) 10Elukey: "For an-conf* we have the following config:" [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [10:29:18] (03CR) 10Elukey: [C: 03+2] statistics::discovery: move cron to timer and add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/554528 (owner: 10Elukey) [10:41:20] (03CR) 10Volans: "I've reviewed the diff from PS3 and tested the test instance. We're almost there, just some tweak to do in the Packages tab and one option" (0330 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [10:46:10] !log T239470 addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki wikidatawiki --from-id=10000007 --to-id=10000007 [10:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:15] T239470: Check the success of the terms migration (does it have holes) - https://phabricator.wikimedia.org/T239470 [10:53:47] (03CR) 10Filippo Giunchedi: "> Patch Set 9:" [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [10:58:58] (03CR) 10Muehlenhoff: [C: 03+1] "> Indeed, is it an hard requirement for zk data to be on a separate filesystem than /" [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [11:12:20] (03PS1) 10Ema: varnishmtail: update tests after varnishxcps decom [puppet] - 10https://gerrit.wikimedia.org/r/555915 [11:12:27] (03PS10) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) [11:12:56] (03CR) 10Filippo Giunchedi: "> Patch Set 9:" [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [11:14:31] 10Operations, 10observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Done! packet drops are gone [11:21:57] (03PS2) 10Ema: varnishmtail: update tests after varnishxcps decom [puppet] - 10https://gerrit.wikimedia.org/r/555915 [11:22:58] (03PS1) 10KartikMistry: Add 'wiki-for-human-rights' CX campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555921 (https://phabricator.wikimedia.org/T239977) [11:23:11] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555922 (https://phabricator.wikimedia.org/T128546) [11:25:47] (03CR) 10Ema: [C: 03+2] varnishmtail: update tests after varnishxcps decom [puppet] - 10https://gerrit.wikimedia.org/r/555915 (owner: 10Ema) [11:27:40] PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:28:43] elukey: FYI ^^^ [11:29:55] (03CR) 10Santhosh: [C: 04-1] Add 'wiki-for-human-rights' CX campaign (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555921 (https://phabricator.wikimedia.org/T239977) (owner: 10KartikMistry) [11:30:04] jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191209T1130). Please do the needful. [11:30:25] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555922 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:31:18] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555922 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:31:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [11:33:06] (03PS2) 10KartikMistry: Add 'wiki-for-human-rights' CX campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555921 (https://phabricator.wikimedia.org/T239977) [11:35:16] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:555907| Bumping portals to master (T128546)]] (duration: 01m 23s) [11:35:18] volans: ah yes thanks it is downtime expired, fixing it [11:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:22] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:36:17] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:555907| Bumping portals to master (T128546)]] (duration: 01m 00s) [11:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:22] (03PS4) 10Volans: frack: fix asset tag management records [dns] - 10https://gerrit.wikimedia.org/r/554079 (https://phabricator.wikimedia.org/T239597) [11:38:24] (03PS3) 10Volans: eqsin: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554080 (https://phabricator.wikimedia.org/T239597) [11:38:26] (03PS3) 10Volans: eqiad: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554081 (https://phabricator.wikimedia.org/T239597) [11:43:12] (03CR) 10Santhosh: [C: 03+1] Add 'wiki-for-human-rights' CX campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555921 (https://phabricator.wikimedia.org/T239977) (owner: 10KartikMistry) [11:53:10] 10Operations, 10Wikimedia-Logstash, 10observability, 10Core Platform Team Legacy (Watching / External), 10Services (watching): Kibana functionality missing after upgrade: histograms - https://phabricator.wikimedia.org/T152782 (10fgiunchedi) 05Open→03Invalid I'm boldly declining this task for now as t... [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191209T1200). [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:01:10] o/ [12:06:53] 10Operations, 10observability: Puppet fail to properly refresh Icinga - https://phabricator.wikimedia.org/T184714 (10fgiunchedi) I'm wondering if we've seen this behavior again? (i.e. certain icinga changes are not applied on puppet `refresh`) [12:11:36] 10Operations, 10Icinga, 10observability, 10Wikimedia-Incident: Icinga: page in case all MediaWiki are throwing 5xx - https://phabricator.wikimedia.org/T186069 (10fgiunchedi) We have availability-based alerts now (i.e. 5xx / all status codes) for varnish and ATS, those can be made paging now I believe as we... [12:39:58] (03PS1) 10Kosta Harlan: GrowthExperiments: Configure testwiki to use local search & config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555928 (https://phabricator.wikimedia.org/T235717) [12:47:07] 10Operations, 10serviceops: High APCu fragmentation can impact server performance - https://phabricator.wikimedia.org/T240205 (10jijiki) [12:49:59] (03PS1) 10Ema: varnishmtail: add varnishttfb.mtail [puppet] - 10https://gerrit.wikimedia.org/r/555930 (https://phabricator.wikimedia.org/T240180) [12:53:57] (03PS1) 10Kosta Harlan: GrowthExperiments: Switch beta labs wikis to use local search/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555932 (https://phabricator.wikimedia.org/T235717) [12:56:15] 10Operations, 10serviceops: High APCu fragmentation can impact server performance - https://phabricator.wikimedia.org/T240205 (10jijiki) a:03jijiki [13:55:49] !log reimage mw2270.codfw.wmnet mw2269.codfw.wmnet mw2268.codfw.wmnet [13:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:36] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw2270.codfw.wmnet', 'mw2269.codfw.wmnet', 'mw2268.codfw.wmnet'] ` The log can be found in `/var/log/... [14:00:11] (03CR) 10Volans: "Few small things." (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [14:06:13] <[1997kB]> cp5007 frontend, Varnish XID 1004621564 [14:06:13] <[1997kB]> Error: 503, Backend fetch failed at Mon, 09 Dec 2019 14:01:54 GMT [14:06:18] (03PS1) 10Ladsgroup: Disable sanity check cirrus jobs for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555941 (https://phabricator.wikimedia.org/T239931) [14:07:27] 10Operations, 10ops-esams, 10Traffic: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10ema) 05Open→03Resolved The host has now been up with the new firmware with no issues for one week. Closing for now, we can re-open if needed. [14:07:30] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10ema) [14:13:37] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [14:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:27] (03CR) 10Volans: [C: 03+2] daemon: fix memory leak in Python 3.7+ [software/keyholder] - 10https://gerrit.wikimedia.org/r/553481 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans) [14:15:32] (03Merged) 10jenkins-bot: daemon: fix memory leak in Python 3.7+ [software/keyholder] - 10https://gerrit.wikimedia.org/r/553481 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans) [14:15:44] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:57] (03CR) 10DCausse: [C: 03+1] Disable sanity check cirrus jobs for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555941 (https://phabricator.wikimedia.org/T239931) (owner: 10Ladsgroup) [14:19:06] 10Operations, 10SRE-swift-storage, 10serviceops, 10Patch-For-Review, and 2 others: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10fgiunchedi) re: ats and client timeouts and retries, yes ats does retry on origin timeout as it seems. Otherw... [14:20:45] (03CR) 10Ladsgroup: [C: 03+2] Disable sanity check cirrus jobs for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555941 (https://phabricator.wikimedia.org/T239931) (owner: 10Ladsgroup) [14:21:37] (03Merged) 10jenkins-bot: Disable sanity check cirrus jobs for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555941 (https://phabricator.wikimedia.org/T239931) (owner: 10Ladsgroup) [14:22:37] I'm deploying this ^ [14:26:23] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:555941|Disable sanity check cirrus jobs for Wikidata (T239931 T229407)]] (duration: 00m 57s) [14:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:29] T239931: Reduce the impact of the sanitizer on wikidata - https://phabricator.wikimedia.org/T239931 [14:26:30] T229407: Spikes in DB traffic and rows/s reads when reading from new terms store - https://phabricator.wikimedia.org/T229407 [14:31:19] (03PS2) 10Ema: varnishmtail: add varnishttfb.mtail [puppet] - 10https://gerrit.wikimedia.org/r/555930 (https://phabricator.wikimedia.org/T240180) [14:32:59] (03CR) 10Filippo Giunchedi: [C: 03+1] varnishmtail: add varnishttfb.mtail [puppet] - 10https://gerrit.wikimedia.org/r/555930 (https://phabricator.wikimedia.org/T240180) (owner: 10Ema) [14:34:21] (03CR) 10Ema: [C: 03+2] varnishmtail: add varnishttfb.mtail [puppet] - 10https://gerrit.wikimedia.org/r/555930 (https://phabricator.wikimedia.org/T240180) (owner: 10Ema) [14:34:28] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Wikibase/repo/includes/ParserOutput/FullEntityParserOutputGenerator.php: T229407, clean up debugging info (duration: 00m 59s) [14:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:36] T229407: Spikes in DB traffic and rows/s reads when reading from new terms store - https://phabricator.wikimedia.org/T229407 [14:34:48] 10Operations, 10Acme-chief, 10Traffic, 10Patch-For-Review: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10Volans) So far so good, leaving it open for another week or two to ensure the issue is totally fixed. [14:35:45] (03PS1) 10Effie Mouzeli: mediawiki: Check APCu fragmentation in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/555950 (https://phabricator.wikimedia.org/T240205) [14:37:50] (03CR) 10Phamhi: "If this pod is meant to use as needed (whenever we need to invoke calicoctl), do we need it to be running all the time? Should we explore " [puppet] - 10https://gerrit.wikimedia.org/r/554969 (https://phabricator.wikimedia.org/T239406) (owner: 10Bstorm) [14:38:18] (03PS1) 10Effie Mouzeli: mediawiki::admin Remove extra colon in fragmentation reporting [puppet] - 10https://gerrit.wikimedia.org/r/555951 [14:40:29] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::admin Remove extra colon in fragmentation reporting [puppet] - 10https://gerrit.wikimedia.org/r/555951 (owner: 10Effie Mouzeli) [14:40:47] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) Finally after a long back and forth with upstream I was able to have my pull request merged. Built the new exporter an... [14:41:28] (03PS1) 10Ema: varnishmtail: install varnishttfb.mtail [puppet] - 10https://gerrit.wikimedia.org/r/555953 (https://phabricator.wikimedia.org/T240180) [14:42:28] (03PS11) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) [14:44:04] (03CR) 10Ema: [C: 03+2] "pcc seems satisfied, and so are we. https://puppet-compiler.wmflabs.org/compiler1003/19849/" [puppet] - 10https://gerrit.wikimedia.org/r/555953 (https://phabricator.wikimedia.org/T240180) (owner: 10Ema) [14:52:27] (03PS5) 10Muehlenhoff: Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) [14:52:41] (03CR) 10Muehlenhoff: "Thanks! These should all be adressed now (some comments are also at the PS3)" (0315 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [14:58:59] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [14:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:14] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) [15:00:27] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) 05Open→03Resolved Done! [15:01:06] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:10] !log upload prometheus-memcached-exporter 0.6.0+git20191209.bac8a8c-1 to buster-wikimedia [15:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:40] (03CR) 10Elukey: "Thanks a lot, I'll make sure to change partman recipe when we'll reimage zookeeper hosts. It will happen soon-ish I think since the conf20" [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:35:13] (03CR) 10Bstorm: [C: 03+1] "Non-blocking comment in-line. LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/555740 (https://phabricator.wikimedia.org/T239415) (owner: 10BryanDavis) [15:36:43] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:52] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [15:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:01] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:31] (03CR) 10Elukey: "Clarification: what I meant to say is that I am ok to move zookeeper to user /srv and avoid any special partman setting for it, it should " [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:46:32] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/554969 (https://phabricator.wikimedia.org/T239406) (owner: 10Bstorm) [15:47:50] (03PS2) 10TechneSiyam: Added 1.5x and 2x logos for wiki project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555776 [15:47:52] (03PS1) 10TechneSiyam: HD logos for wikiprojects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555980 [15:47:54] (03PS1) 10TechneSiyam: modified initialise settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555981 [15:50:39] (03PS2) 10Urbanecm: HD logos for wikiprojects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555980 (owner: 10TechneSiyam) [15:50:52] (03PS2) 10Urbanecm: modified initialise settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555981 (owner: 10TechneSiyam) [15:51:57] (03Abandoned) 10Urbanecm: Added 1.5x and 2x logos for wiki project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555776 (owner: 10TechneSiyam) [15:57:16] !log installing openslp security updates [15:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:53] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2270.codfw.wmnet', 'mw2269.codfw.wmnet', 'mw2268.codfw.wmnet'] ` and were **ALL** successful. [16:01:56] (03PS2) 10Giuseppe Lavagetto: trafficserver: use https discovery url for blubberoid [puppet] - 10https://gerrit.wikimedia.org/r/553369 (https://phabricator.wikimedia.org/T236017) [16:02:46] (03CR) 10Urbanecm: [C: 04-1] "Two minor notes:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555980 (owner: 10TechneSiyam) [16:03:00] (03CR) 10Urbanecm: [C: 03+1] "this is great, thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555981 (owner: 10TechneSiyam) [16:03:35] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: use https discovery url for blubberoid [puppet] - 10https://gerrit.wikimedia.org/r/553369 (https://phabricator.wikimedia.org/T236017) (owner: 10Giuseppe Lavagetto) [16:08:11] (03Abandoned) 10Urbanecm: Added 1.5x and 2x logos for wiki project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555775 (owner: 10TechneSiyam) [16:08:16] (03Abandoned) 10Urbanecm: Modified InitialiseSettings with 1.5x and 2x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555723 (owner: 10TechneSiyam) [16:12:37] (03Abandoned) 10Urbanecm: Added 1.5x and 2x pngs of wiki project logos that are in SVG. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555649 (owner: 10TechneSiyam) [16:12:51] (03Abandoned) 10Urbanecm: Added 1.5x and 2x pngs of wiki project logos that are in SVG. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555730 (owner: 10TechneSiyam) [16:23:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/553369 (https://phabricator.wikimedia.org/T236017) (owner: 10Giuseppe Lavagetto) [16:23:41] what's the plan for the train this week? [16:24:02] are we going to skip wmf.8 entirely and go to wmf.10 on its regular schedule, or are we going to try to get wmf.8 out past group0 today? [16:24:27] trying to determine whether to wait for the train to roll before attempting the next round of parsoid config changes (turning linting back on) or just go ahead. [16:24:55] 10Operations, 10ops-codfw: Degraded RAID on ganeti2002 - https://phabricator.wikimedia.org/T239009 (10RobH) [16:25:46] (03PS8) 10CDanis: atlasexporter: generate metadata metric [puppet] - 10https://gerrit.wikimedia.org/r/555513 [16:26:34] (03PS1) 10Elukey: Add overrides to analytics1057 for raid check policy [puppet] - 10https://gerrit.wikimedia.org/r/555985 (https://phabricator.wikimedia.org/T239045) [16:27:53] (03CR) 10Elukey: [C: 03+2] Add overrides to analytics1057 for raid check policy [puppet] - 10https://gerrit.wikimedia.org/r/555985 (https://phabricator.wikimedia.org/T239045) (owner: 10Elukey) [16:30:30] (03CR) 10CDanis: [C: 03+2] "❌cdanis@evebox ~/work/gits/puppet 🕦☕ curl -s https://puppet-compiler.wmflabs.org/compiler1002/19851/netmon2001.wikimedia.org/change.netmon" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/555513 (owner: 10CDanis) [16:33:17] RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteThrough policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:39:19] cscott: i believe we are going to try to get wmf.8 past group0 today. [16:41:19] (answering on ticket.) [16:45:38] (03PS1) 10Filippo Giunchedi: monitoring: page on low HTTP global availability [puppet] - 10https://gerrit.wikimedia.org/r/555987 (https://phabricator.wikimedia.org/T186069) [16:50:26] (03CR) 10CDanis: [C: 03+2] "Did some manual inspection of this query in grafana/explore for a few sites the past six months, and looks good." [puppet] - 10https://gerrit.wikimedia.org/r/555550 (https://phabricator.wikimedia.org/T239039) (owner: 10CDanis) [16:51:42] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10RobH) Please note that I am around and able to assist for this. @Jclark-ctr: I recommend you snag 6 DAC cables, and figure out where each of these new systems is... [16:52:07] 10Operations, 10vm-requests: eqiad/codfw: 2 VMs for corp LDAP replicas - https://phabricator.wikimedia.org/T231015 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete, setup of the service is tracked in T224557 [16:52:33] (03PS1) 10Muehlenhoff: Turn old LDAP replicas into spares [puppet] - 10https://gerrit.wikimedia.org/r/555990 (https://phabricator.wikimedia.org/T224557) [16:53:58] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10RobH) [16:56:11] brennen: I'm about. :-) [16:56:22] welcome to monday. :) [16:56:33] Same as the oldday? Wait, not. [16:56:37] brennen: thanks. i'll stick around in here to see the schedule y'all decide. [16:56:40] (03CR) 10Cwhite: [C: 03+2] hiera: update ores statsd exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/554656 (https://phabricator.wikimedia.org/T233448) (owner: 10Cwhite) [16:57:18] brennen: Let's JFDI? [16:57:34] James_F: that's what i was about to suggest - sooner probably better than later and all that... [16:57:40] Yeah. [16:58:39] we've got a parsoid deploy window today to fix some unrelated minor crashers, and turning linting back on would be a puppet patch in a SWAT window, just fyi from our side. [16:59:27] James_F: just pulling up the usual places-to-watch-for-breakage and then will run deploy-promote. [16:59:34] Cool. [17:03:21] !log attempting to roll 1.35.0-wmf.8 forward to group1 [17:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:05] (03PS1) 10Brennen Bearnes: group1 wikis to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555996 [17:04:07] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555996 (owner: 10Brennen Bearnes) [17:05:04] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555996 (owner: 10Brennen Bearnes) [17:06:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/555990 (https://phabricator.wikimedia.org/T224557) (owner: 10Muehlenhoff) [17:06:42] * cscott crosses fingers [17:07:14] * James_F crosses toes, as they don't stop him from fixing prod if things break. [17:07:16] i should have caffeinated [17:07:29] live any second now... [17:07:39] brennen: Rolling the derailed train should give you enough of a jolt of adrenaline, surely? ;-) [17:08:07] i'm certainly awake [17:08:28] nothing like imminent danger to make you feel awake and alive [17:08:36] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.8 [17:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:37] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.8 (duration: 01m 00s) [17:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:50] Seems OK so far. [17:10:05] Quick burst of timeouts as ever but not above nominal. [17:10:27] (A bunch of them have moved over from wmf.5 to wmf.8 with the new load, of course.) [17:10:59] yeah looking pretttty quiet... [17:12:05] Wait a half hour for things to shake out, then bump group2? [17:12:19] James_F: works for me. [17:12:20] ooooohhh hope it takes [17:13:13] logging out and in still works on wikitech. I haven't tried account creation yet (distracted by SRE meeting) [17:13:45] andrewbogott: i can run through an account creation [17:13:51] that'd be great, thanks [17:14:12] lmk what username/shell name you use and I'll double-check that ldap gets it [17:16:12] andrewbogott: https://wikitech.wikimedia.org/wiki/User:Testy_McTesterton / testymct [17:17:15] (03PS1) 10Jhedden: aptrepo: add ceph nautilus repo for cloudvps [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) [17:17:28] brennen: ldap looks fine [17:17:36] cool. [17:17:43] I take it the creation/login/etc worked ok? [17:17:55] yep, all as-expected, got confirmation email, etc. [17:18:04] great! [17:18:31] so I think we're all fine — I have a followup config patch which I try to get in swat later in the week (if Reedy doesn't beat me to it) [17:19:05] thank you brennen [17:20:00] andrewbogott: sure thing. [17:20:51] (03PS2) 10Jhedden: aptrepo: add ceph nautilus repo for cloudvps [puppet] - 10https://gerrit.wikimedia.org/r/556000 (https://phabricator.wikimedia.org/T239917) [17:22:12] Main logspam is wfEscapeWikiText throwing "Array to string conversion" warnings, AFAICS. [17:23:19] yeah, which has been a constant for ages. [17:23:34] Can't see a task. Will file one. [17:23:53] James_F: i think there is one, one sec while i dig [17:23:59] Oh, cool. [17:26:21] (03PS1) 10Arlolra: Make Parsoid/PHP cluster read-write to record lints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556001 (https://phabricator.wikimedia.org/T237326) [17:27:47] brennen: Oh, I see. Fixed in wmf.8. https://phabricator.wikimedia.org/T237559 :-) [17:27:48] James_F: T239239, closed as dupe of T237559; fixed in wmf.10 [17:27:49] T237559: wfEscapeWikiText() emits error "PHP Notice: Array to string conversion" on Special:Search - https://phabricator.wikimedia.org/T237559 [17:27:49] T239239: PHP Notice: Array to string conversion - https://phabricator.wikimedia.org/T239239 [17:27:58] (er maybe wmf.8) [17:28:04] Hah, no, you're right, wmf.10. [17:29:54] The War on Logspam® seems have born serious fruit. [17:34:56] eternal vigilance is the price of, uh, log liberty. [17:35:14] Eternal logilance? [17:35:19] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:35:50] Hmm. [17:36:46] Peak went up at 17:32, 20 mins after the deploy. [17:36:50] Just load? [17:37:25] (03CR) 10Muehlenhoff: [C: 04-1] "Self -1, there's a bug which needs to be fixed before this is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) (owner: 10Muehlenhoff) [17:38:07] maybe? [17:38:34] Falling back right now. [17:39:51] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:42:42] feeling like that one was unrelated enough to go ahead to group2? [17:43:53] seems to be creeping back up a bit. [17:44:33] Yeah. [17:44:50] Sorry, that's yes, let's go to group2. [17:48:46] cool, rolling. [17:49:35] (03PS1) 10Brennen Bearnes: all wikis to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556004 [17:49:37] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556004 (owner: 10Brennen Bearnes) [17:50:44] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556004 (owner: 10Brennen Bearnes) [17:52:34] 10Operations, 10Analytics, 10SRE-Access-Requests: Add accraze to analytics-privatedata-users - https://phabricator.wikimedia.org/T240243 (10Halfak) [17:52:49] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.8 [17:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:11] 10Operations, 10Analytics, 10SRE-Access-Requests: Add accraze to analytics-privatedata-users - https://phabricator.wikimedia.org/T240243 (10Halfak) I approve from my side. Andy will need this access to be able to use Hadoop-related resources for building and analyzing machine learning models. [17:53:16] welp - piles of "ErrorException (/srv/mediawiki/php-1.35.0-wmf.8/extensions/Cite/src/Cite.php:638) PHP Notice: Undefined index: number" [17:53:27] (small piles) [17:56:55] Blame ze Germans! [17:58:02] why 😭 [17:58:17] i'm rolling back. this isn't getting less noisy. [17:58:24] Lucas_WMDE: WMDE has been touching that code ;) [17:58:50] Fun. [17:58:57] https://github.com/wikimedia/mediawiki-extensions-Cite/commit/a823fa23d90f66e8d9f405b4e5484ad781cd5506 [17:59:05] Adam and Thiemo [17:59:16] Of course, master of that file doesn't have 638 lines... [17:59:20] awight ^ [18:00:04] gehel and onimisionipe: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191209T1800). [18:00:15] jouncebot: here here [18:00:21] rollback of train ongoing, fyi. [18:01:10] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert "group2 wikis to 1.35.0-wmf.5" [18:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:41] (03PS1) 10Brennen Bearnes: Revert "all wikis to 1.35.0-wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556005 [18:02:43] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "all wikis to 1.35.0-wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556005 (owner: 10Brennen Bearnes) [18:03:36] (03Merged) 10jenkins-bot: Revert "all wikis to 1.35.0-wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556005 (owner: 10Brennen Bearnes) [18:04:22] (03CR) 10Alexandros Kosiaris: "<3" [puppet] - 10https://gerrit.wikimedia.org/r/555555 (owner: 10CDanis) [18:06:23] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@9f9190e]: New WDQS Build [18:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:00] (03CR) 10Giuseppe Lavagetto: [C: 04-1] mediawiki: Check APCu fragmentation in php-check-and-restart.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/555950 (https://phabricator.wikimedia.org/T240205) (owner: 10Effie Mouzeli) [18:08:32] (03PS2) 10Effie Mouzeli: mediawiki: Check APCu fragmentation in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/555950 (https://phabricator.wikimedia.org/T240205) [18:08:50] (03CR) 10Effie Mouzeli: mediawiki: Check APCu fragmentation in php-check-and-restart.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/555950 (https://phabricator.wikimedia.org/T240205) (owner: 10Effie Mouzeli) [18:09:26] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@9f9190e]: New WDQS Build (duration: 03m 02s) [18:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:52] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@9f9190e]: New WDQS Build [18:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:51] (03CR) 10Giuseppe Lavagetto: [C: 04-1] mediawiki: Check APCu fragmentation in php-check-and-restart.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/555950 (https://phabricator.wikimedia.org/T240205) (owner: 10Effie Mouzeli) [18:13:30] (03PS3) 10Effie Mouzeli: mediawiki: Check APCu fragmentation in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/555950 (https://phabricator.wikimedia.org/T240205) [18:14:46] (03CR) 10Giuseppe Lavagetto: mediawiki: Check APCu fragmentation in php-check-and-restart.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/555950 (https://phabricator.wikimedia.org/T240205) (owner: 10Effie Mouzeli) [18:15:46] (03PS4) 10Effie Mouzeli: mediawiki: Check APCu fragmentation in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/555950 (https://phabricator.wikimedia.org/T240205) [18:18:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Check APCu fragmentation in php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/555950 (https://phabricator.wikimedia.org/T240205) (owner: 10Effie Mouzeli) [18:18:52] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10RobH) >>! In T236327#5610797, @elukey wrote: > We can definitely reimage if it is the best path suggested by SRE, but if possible I'd do it manually (so commentin... [18:19:25] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@9f9190e]: New WDQS Build (duration: 09m 33s) [18:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:13] filed T240248, sent train blockage mail, getting a hot beverage. [18:21:13] T240248: "PHP Notice: Undefined index: key" and similar in Cite.php and ReferenceStack.php - https://phabricator.wikimedia.org/T240248 [18:25:44] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10RobH) a:05Cmjohnson→03Jclark-ctr [18:26:31] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10RobH) Please note I've chatted with @wiki_willy, @jclark-ctr, & @elukey about this, and I've updated all of the checklists for each server with the following: [... [18:27:55] PROBLEM - mediawiki-installation DSH group on mw2270 is CRITICAL: Host mw2270 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [18:28:09] ^ that is me [18:28:59] (03PS1) 10CDanis: atlasexporter: use target_site instead of site [puppet] - 10https://gerrit.wikimedia.org/r/556015 [18:30:19] (03CR) 10CDanis: [C: 03+2] atlasexporter: use target_site instead of site [puppet] - 10https://gerrit.wikimedia.org/r/556015 (owner: 10CDanis) [18:30:41] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) @jlinehan thoughts? I'm considering moving forward with intake-{analytics,logging}. [18:31:29] (03PS4) 10Herron: lvs: add entries for logstash-next and kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/554905 (https://phabricator.wikimedia.org/T234854) [18:34:25] (03CR) 10Herron: [C: 03+2] lvs: add entries for logstash-next and kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/554905 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [18:36:41] Reedy: Hi, I just saw the note. It's definitely caused by our recent work, I'll try to have a minimal fix for review in an hour or two. [18:37:03] awight: Cheers, looks like brennen filed T240248 [18:37:04] T240248: "PHP Notice: Undefined index: key" and similar in Cite.php and ReferenceStack.php - https://phabricator.wikimedia.org/T240248 [18:37:07] awight: thanks! [18:37:13] (03PS1) 10Jhedden: ceph: initial monitor and mgr daemon modules [puppet] - 10https://gerrit.wikimedia.org/r/556017 (https://phabricator.wikimedia.org/T239918) [18:37:26] !log enabling lvs for kibana-next elk7 upgrade environment, in case any alerts fire relating to this please disreagard them [18:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:34] brennen: Thanks for making the task! [18:41:39] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana is broken https://wikitech.wikimedia.org/wiki/Confd [18:42:15] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana is broken https://wikitech.wikimedia.org/wiki/Confd [18:46:11] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 97 connections established with conf1004.eqiad.wmnet:4001 (min=98) https://wikitech.wikimedia.org/wiki/PyBal [18:47:29] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.51:443]) https://wikitech.wikimedia.org/wiki/PyBal [18:49:09] PROBLEM - Host kibana-next.svc.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [18:49:10] PROBLEM - Host kibana-next.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [18:49:12] (03PS1) 10Halfak: Standardizes English dictionaries on hunspell for English in ORES [puppet] - 10https://gerrit.wikimedia.org/r/556023 (https://phabricator.wikimedia.org/T239942) [18:49:32] new service, please disregard [18:50:11] ok [18:50:13] (03PS3) 10Herron: dns: add kibana-next and logstash-next service addresses [dns] - 10https://gerrit.wikimedia.org/r/554906 (https://phabricator.wikimedia.org/T234854) [18:50:16] right [18:50:53] (03CR) 10Herron: [C: 03+2] dns: add kibana-next and logstash-next service addresses [dns] - 10https://gerrit.wikimedia.org/r/554906 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [18:51:02] (03PS2) 10Halfak: Standardizes English dictionaries on hunspell for English in ORES [puppet] - 10https://gerrit.wikimedia.org/r/556023 (https://phabricator.wikimedia.org/T239942) [18:52:53] <_joe_> ok [18:54:08] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 61 connections established with conf1004.eqiad.wmnet:4001 (min=62) https://wikitech.wikimedia.org/wiki/PyBal [18:54:26] RECOVERY - Host kibana-next.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.15 ms [19:00:00] RECOVERY - Host kibana-next.svc.codfw.wmnet is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [19:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191209T1900). [19:00:04] IAmNetx: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:15] I can SWAT today! [19:00:19] Hi, I'm here :D [19:00:22] hey [19:00:43] (03PS2) 10Urbanecm: Add aliases for Help and Project on eswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555692 (https://phabricator.wikimedia.org/T240050) (owner: 10IAmNetx) [19:01:03] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555692 (https://phabricator.wikimedia.org/T240050) (owner: 10IAmNetx) [19:01:17] (03PS7) 10Urbanecm: Upload HD logos for en, fi and nl arbcom wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [19:01:20] !log continue osm-import on maps1004 - T239728 [19:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:25] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [19:01:26] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [19:01:59] (03Merged) 10jenkins-bot: Add aliases for Help and Project on eswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555692 (https://phabricator.wikimedia.org/T240050) (owner: 10IAmNetx) [19:02:39] IAmNetx: please test your patch at mwdebug1001 and let me know! [19:03:01] will do [19:03:22] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.51:443]) https://wikitech.wikimedia.org/wiki/PyBal [19:03:32] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kibana is broken https://wikitech.wikimedia.org/wiki/Confd [19:05:15] Looks good to me [19:05:50] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kibana is broken https://wikitech.wikimedia.org/wiki/Confd [19:06:15] thanks IAmNetx, syncing [19:06:16] PROBLEM - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 51 connections established with conf2001.codfw.wmnet:2379 (min=52) https://wikitech.wikimedia.org/wiki/PyBal [19:06:16] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.51:443]) https://wikitech.wikimedia.org/wiki/PyBal [19:07:37] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: f984d18: Add aliases for Help and Project on eswikisource (T240050) (duration: 01m 00s) [19:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:42] T240050: New namespace aliases for spanish wikisource - https://phabricator.wikimedia.org/T240050 [19:07:51] IAmNetx: done! Congratulations for your first deploy! [19:08:04] hooray! [19:08:10] (03PS8) 10Urbanecm: Upload HD logos for en, fi and nl arbcom wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [19:08:17] (03CR) 10Urbanecm: Upload HD logos for en, fi and nl arbcom wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [19:08:21] (03CR) 10Urbanecm: [C: 03+2] Upload HD logos for en, fi and nl arbcom wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [19:09:08] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.51:443]) https://wikitech.wikimedia.org/wiki/PyBal [19:09:11] (03Merged) 10jenkins-bot: Upload HD logos for en, fi and nl arbcom wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [19:09:35] Zoranzoki21: can you check yours now, please? [19:09:42] Sure Urbanecm [19:10:39] Hmm where is arbcom wiki? [19:10:42] Which URL [19:11:00] PROBLEM - PyBal connections to etcd on lvs2006 is CRITICAL: CRITICAL: 51 connections established with conf2001.codfw.wmnet:2379 (min=52) https://wikitech.wikimedia.org/wiki/PyBal [19:11:38] arbcom-xx.wikipedia.org [19:11:44] https://arbcom-en.wikipedia.org/wiki/Main_Page [19:12:17] I can't see problem [19:13:10] great! [19:14:44] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: 32da89f: Upload HD logos for en, fi and nl arbcom wikis (1/2, T150618) (duration: 01m 01s) [19:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:49] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [19:15:04] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana-next-ssl on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/kibana-next-ssl https://wikitech.wikimedia.org/wiki/Confd [19:16:01] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 32da89f: Upload HD logos for en, fi and nl arbcom wikis (2/2, T150618) (duration: 01m 00s) [19:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:25] 10Operations, 10Traffic, 10observability: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 (10CDanis) 05Open→03Resolved a:03CDanis Looking at some data in grafana explore, this would have solved most cases of noise in the past fe... [19:16:31] brennen: API load has gone back to nominal, AFAICS. So good news, group1 can stay on wmf.8. [19:17:19] (03PS1) 10Herron: Revert "lvs: add entries for logstash-next and kibana-next" [puppet] - 10https://gerrit.wikimedia.org/r/556029 [19:17:21] Zoranzoki21: done [19:17:41] (03PS1) 10Herron: Revert "dns: add kibana-next and logstash-next service addresses" [dns] - 10https://gerrit.wikimedia.org/r/556033 [19:17:53] James_F: progress! [19:18:11] ty [19:18:12] !log Run namespaceDupes.php for eswikisource (T240050) [19:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:18] T240050: New namespace aliases for spanish wikisource - https://phabricator.wikimedia.org/T240050 [19:18:31] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana-next-ssl on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/kibana-next-ssl https://wikitech.wikimedia.org/wiki/Confd [19:18:32] !log Purge several logo files (T150618) [19:18:33] (03CR) 10Herron: [C: 03+2] Revert "lvs: add entries for logstash-next and kibana-next" [puppet] - 10https://gerrit.wikimedia.org/r/556029 (owner: 10Herron) [19:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:43] !log Morning SWAT done [19:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:14] (03CR) 10Herron: [C: 03+2] Revert "dns: add kibana-next and logstash-next service addresses" [dns] - 10https://gerrit.wikimedia.org/r/556033 (owner: 10Herron) [19:19:37] (03PS1) 10EBernhardson: java::analytics: Make java 8 the default on buster [puppet] - 10https://gerrit.wikimedia.org/r/556034 (https://phabricator.wikimedia.org/T236180) [19:21:17] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 61 connections established with conf1004.eqiad.wmnet:4001 (min=61) https://wikitech.wikimedia.org/wiki/PyBal [19:21:43] PROBLEM - LVS HTTP IPv4 #page on kibana-next.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.51 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:22:07] (03PS1) 10Herron: Revert "Revert "dns: add kibana-next and logstash-next service addresses"" [dns] - 10https://gerrit.wikimedia.org/r/556035 [19:22:21] (03PS1) 10Herron: Revert "Revert "lvs: add entries for logstash-next and kibana-next"" [puppet] - 10https://gerrit.wikimedia.org/r/556036 [19:22:38] <_joe_> need help? [19:23:09] i mean, in general, always ;) [19:23:51] <_joe_> ebernhardson: ahha [19:23:51] _joe_: thx I think I spotted the problem as a mismatch between the lvs_service_ips set and the ip used by the new lvs service [19:23:58] lol [19:24:11] <_joe_> ok, I'm going back to dinner then [19:24:32] PROBLEM - Host kibana-next.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [19:24:32] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/19853/" [puppet] - 10https://gerrit.wikimedia.org/r/556034 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [19:24:36] sounds good, sorry for the interruption [19:24:37] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:28:11] RECOVERY - mediawiki-installation DSH group on mw2270 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:28:47] (03PS1) 10Ottomata: Fix jdk-8 path for alternatives::select in profile::java::analytics [puppet] - 10https://gerrit.wikimedia.org/r/556037 (https://phabricator.wikimedia.org/T236180) [19:29:11] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix jdk-8 path for alternatives::select in profile::java::analytics [puppet] - 10https://gerrit.wikimedia.org/r/556037 (https://phabricator.wikimedia.org/T236180) (owner: 10Ottomata) [19:32:40] (03PS2) 10Herron: Revert "Revert "lvs: add entries for logstash-next and kibana-next"" [puppet] - 10https://gerrit.wikimedia.org/r/556036 [19:32:45] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:35:25] (03PS2) 10Jhedden: ceph: initial monitor and mgr daemon modules [puppet] - 10https://gerrit.wikimedia.org/r/556017 (https://phabricator.wikimedia.org/T239918) [19:35:39] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:35:40] PROBLEM - Host kibana-next.svc.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [19:35:51] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana-next-ssl on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/kibana-next-ssl https://wikitech.wikimedia.org/wiki/Confd [19:36:39] we're still to ignore the kibana pages, yes? [19:37:03] yes [19:37:05] RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 51 connections established with conf2001.codfw.wmnet:2379 (min=51) https://wikitech.wikimedia.org/wiki/PyBal [19:37:15] icinga taking some time to update [19:38:01] RECOVERY - PyBal connections to etcd on lvs2006 is OK: OK: 51 connections established with conf2001.codfw.wmnet:2379 (min=51) https://wikitech.wikimedia.org/wiki/PyBal [19:40:27] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kibana-next-ssl on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/kibana-next-ssl https://wikitech.wikimedia.org/wiki/Confd [19:41:38] ok! [19:45:15] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 97 connections established with conf1004.eqiad.wmnet:4001 (min=97) https://wikitech.wikimedia.org/wiki/PyBal [19:46:23] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:48:46] brennen: James_F thanks for all the train work and communication you're both doing today. [19:53:32] thcipriani: any old time [19:55:22] :) [19:59:50] (03PS3) 10Herron: Revert "Revert "lvs: add entries for logstash-next and kibana-next"" [puppet] - 10https://gerrit.wikimedia.org/r/556036 [20:03:29] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [20:12:00] (03CR) 10Phamhi: [C: 03+1] toolforge-calico: Set up yaml and config to use calicoctl as a pod [puppet] - 10https://gerrit.wikimedia.org/r/554969 (https://phabricator.wikimedia.org/T239406) (owner: 10Bstorm) [20:15:27] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) Hey also, before I go through with this; is there any issue with CORS here? If we go with a separate (non wikimed... [20:17:01] (03PS4) 10Herron: Revert "Revert "lvs: add entries for logstash-next and kibana-next"" [puppet] - 10https://gerrit.wikimedia.org/r/556036 [20:38:27] 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10wiki_willy) a:03Jclark-ctr Looks like it's out of warranty. We can purchase a replacement DIMM, but will need the correct specs to place the order. Thanks Willy [20:46:05] (03PS1) 10Cwhite: hiera: update ores to pass statsd through the statsd exporter [puppet] - 10https://gerrit.wikimedia.org/r/556052 (https://phabricator.wikimedia.org/T205870) [20:54:19] 10Puppet, 10Beta-Cluster-Infrastructure: Add memcached to mwmaint01 using puppet - https://phabricator.wikimedia.org/T240263 (10Ladsgroup) [21:00:04] cscott, arlolra, subbu, halfak, and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191209T2100). [21:24:41] (03CR) 10Bstorm: "If I'm just reviewing too early, lemme know. :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/556017 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [21:27:55] (03CR) 10Jhedden: "> Patch Set 2:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/556017 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [21:29:07] (03PS3) 10Jhedden: ceph: initial monitor and mgr daemon modules [puppet] - 10https://gerrit.wikimedia.org/r/556017 (https://phabricator.wikimedia.org/T239918) [21:29:36] (03CR) 10Bstorm: [C: 03+1] "Overall, looks really good, but maybe add # TODO comments to address the inline issues (or however)." [puppet] - 10https://gerrit.wikimedia.org/r/556017 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [21:30:21] (03CR) 10Jhedden: "Removed the OSD and client files. These manifests will be re-added once they're ready." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/556017 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [21:31:44] (03CR) 10Bstorm: [C: 03+1] ceph: initial monitor and mgr daemon modules [puppet] - 10https://gerrit.wikimedia.org/r/556017 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [21:32:08] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@08cfd70]: Set location of ivy cache for spark [21:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:32] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@08cfd70]: Set location of ivy cache for spark (duration: 00m 24s) [21:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:04] (03PS4) 10Jhedden: ceph: initial monitor and mgr daemon modules [puppet] - 10https://gerrit.wikimedia.org/r/556017 (https://phabricator.wikimedia.org/T239918) [21:45:01] 10Operations, 10observability: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 (10colewhite) 05Resolved→03Open [21:45:16] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@aa65057]: Update mobileapps to f9771ab [21:45:18] 10Operations, 10observability: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 (10colewhite) needs to be done in codfw as well [21:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:52] !log restart prometheus on prometheus2003 -- T238807 [21:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:57] T238807: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 [21:55:54] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@aa65057]: Update mobileapps to f9771ab (duration: 10m 39s) [21:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:04] Reedy and sbassett: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191209T2200). [22:05:39] (03CR) 10Arlolra: [C: 04-1] "We probably want to block on this now," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556001 (https://phabricator.wikimedia.org/T237326) (owner: 10Arlolra) [22:22:15] (03CR) 10Subramanya Sastry: "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556001 (https://phabricator.wikimedia.org/T237326) (owner: 10Arlolra) [22:26:50] 10Operations, 10Traffic: Clean up DNS server puppetization - https://phabricator.wikimedia.org/T240285 (10BBlack) [22:27:21] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) [22:27:24] 10Operations, 10Traffic: Clean up DNS server puppetization - https://phabricator.wikimedia.org/T240285 (10BBlack) [22:28:01] brennen: I'll wait here for your signal to test mwdebug1002. [22:32:10] If that looks okay, maybe we should do a few minutes of canary traffic, to get a bigger sample? Sorry I'm unfamiliar with how the train is pushed in cases like this. [22:32:49] Deploying sec patch for T192134 soon... [22:33:22] awight: just ran `scap pull` on mwdebug1002 [22:34:01] It'll only get your manual traffic though. [22:34:20] Theoretically we could manually pull onto a real production server, but that's not standard. [22:37:07] !log Deployed security patch for T192134 to wmf.5 [22:37:08] ack [22:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:31] James_F / awight: if last time is any indication, we'd know whether the fix has taken very quickly on rolling forward... [22:40:27] !log Deployed security patch for T192134 to wmf.8 [22:40:29] :D it seems to be a popular extension [22:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:00] brennen: I'm not sure I'm checking the right thing here: reading logstash for mwdebug1002, but I don't know whether that would show the same level of logging as logspam-watch. [22:41:16] I don't see the latter script on the debug server, unfortunately. [22:41:49] awight: logstash should cover it, i think. (logspam-watch is just on mwlog boxen.) [22:42:02] k thanks! [22:43:44] Do you happen to know how long the latency is between mwdebug and logstash? [22:44:00] A few seconds at most. [22:44:08] ty [22:44:29] Did we see the errors on non-group2 wikis/ [22:44:59] I've tried three URLs which seemed to be linked to the most stack traces (although never able to reproduce it locally), and none of the Cite array errors are appearing in logstash. [22:45:16] James_F: Good question, let me look now. [22:45:24] On wmf.8 wikis? [22:46:03] yeah, definitely a good question. i'm pretty sure i haven't seen anything since rolling back to group 1. [22:46:14] (03PS1) 10EBernhardson: Deploy analytics-search keytab to an-airflow [puppet] - 10https://gerrit.wikimedia.org/r/556072 [22:46:31] Might well be a codepath only used on mega-Wikipedias, for instance. [22:46:34] (03CR) 10jerkins-bot: [V: 04-1] Deploy analytics-search keytab to an-airflow [puppet] - 10https://gerrit.wikimedia.org/r/556072 (owner: 10EBernhardson) [22:47:08] * awight hulks up shoulders to reproduce error more convincingly [22:47:28] We should roll the train and find out. ;-) [22:47:42] I'm ready [22:47:49] brennen's call. [22:48:02] (03CR) 10EBernhardson: "I'm not sure if this is right, but what i'm trying to do is allow the `airflow` user to submit jobs to hadoop with the `analytics-search` " [puppet] - 10https://gerrit.wikimedia.org/r/556072 (owner: 10EBernhardson) [22:48:20] yep, let's do it. we're skating a bit close to the 15:00 pacific cutoff, so i'd say this is the last try for the day. [22:48:33] Yeah. [22:48:36] sbassett: prod clear on security patches? [22:48:53] brennen: yes, I'm done. [22:49:02] thanks. going ahead. [22:50:23] James_F: erm, i have the hotpatch checked out on extensions/Cite and i've got: [22:50:24] modified: extensions/Cite [22:50:34] * James_F has a look. [22:50:37] in git status for php-1.35.0-wmf.8 - this is sane, yes? [22:51:41] Yeah, LGTM, but I'll tweak. [22:52:23] brennen: Go for it. [22:52:47] goin' [22:53:09] (03PS1) 10Brennen Bearnes: all wikis to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556078 [22:53:11] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556078 (owner: 10Brennen Bearnes) [22:53:52] James_F: I found an instance of one error on testwiki, so feeling more confident that "bad code" is the culprit :-) But everything else was from group2 wikis. [22:54:03] Right. [22:54:10] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556078 (owner: 10Brennen Bearnes) [22:54:40] !log restart prometheus on prometheus2004 -- T238807 [22:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:45] T238807: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 [22:54:49] (03CR) 10Jhedden: [C: 03+1] toolforge-calico: Set up yaml and config to use calicoctl as a pod [puppet] - 10https://gerrit.wikimedia.org/r/554969 (https://phabricator.wikimedia.org/T239406) (owner: 10Bstorm) [22:55:29] Wait. [22:55:40] brennen, awight: We didn't actually sync the Cite change out… [22:55:53] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.8 [22:55:53] Whew! I was about to shout "nooooo" down the well [22:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:26] I mean, it'll give us a natural experiment on it working, but… whoops. [22:56:38] Haha, it "works" alright [22:56:59] welp. [22:57:02] reverting. [22:57:13] brennen: Just sync-dir extensions/Cite. [22:57:24] James_F: whoops, kk. [22:57:36] with the php-version prefix on the front ;) [22:57:46] That too. [22:57:49] James_F: I might have misunderstood you. The wmf.8 code that was just deployed was the unpatched or patched code? [22:58:19] awight: Except for mwdebug1001 where you manually pulled it to the patched code, everywhere else was still running the unpatched code. [22:58:26] kk [22:58:58] PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [23:00:17] !log brennen@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/Cite: Sync [[gerrit:556066|Hotfix: Defensive array accesses (T240248)]] (duration: 00m 57s) [23:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:22] T240248: "PHP Notice: Undefined index: key" and similar in Cite.php and ReferenceStack.php - https://phabricator.wikimedia.org/T240248 [23:00:33] brennen: Fancy log message and all. [23:00:55] So far, all looks quiet. [23:01:25] brennen: Congratulations, train done? Now it's my turn to complain to awight, this time about tomorrow's branch. ;-) [23:01:45] :D James_F thanks, I was hoping you might have a minute to do that! [23:02:22] Do you have thoughts about how to deploy this much code churn, safely? [23:02:32] PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [23:03:03] The code churn theoretically is "fine", it's just testing coverage that's so low we can't be sure what'll happen. [23:03:06] PROBLEM - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [23:03:23] (03PS2) 10EBernhardson: Deploy analytics-search keytab to an-airflow [puppet] - 10https://gerrit.wikimedia.org/r/556072 [23:03:34] ...i hate to say this, but i'm not entirely sure the undefined index warnings have... [23:03:44] (03CR) 10jerkins-bot: [V: 04-1] Deploy analytics-search keytab to an-airflow [puppet] - 10https://gerrit.wikimedia.org/r/556072 (owner: 10EBernhardson) [23:03:58] yeah: ErrorException (/srv/mediawiki/php-1.35.0-wmf.8/extensions/Cite/src/ReferenceStack.php:186) PHP Notice: Undefined index: BioPrimeiro3 [23:04:12] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [23:04:26] Hmm, yes. [23:04:44] Is it a mis-behaving template? [23:04:51] (03CR) 10EBernhardson: Deploy analytics-search keytab to an-airflow (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/556072 (owner: 10EBernhardson) [23:04:55] brennen: There could be a long tail on these, since large articles can take a minute or more to save. but... yeah it seems like it should have settled. [23:05:19] James_F: The extension needs to be immune to a nearly-infinite number of misbehaving templates, I'm afraid. [23:05:26] Plausibly all from a single page save, indeed. [23:08:43] they're definitely continuing. i'm also seeing "Undefined index: " - empty string i'm guessing? which may be a new variation. [23:08:52] Yeah… [23:08:57] I'll try to quickly write a hotfix for the last two lines; let me know if we can make the cutoff, though. [23:09:15] The cut-off is arbitrary. I'd rather we fix-in-place if possible. [23:09:22] ^ yeah. [23:09:28] The 500s level isn't rising, so readers are doing OK. [23:09:53] cool, almost done with a patch. [23:09:57] and errors in general are (::knocking wood::) pretty quiet; i think if we can just get this nailed down we'll be in a good place for tomorrow. [23:10:23] 🤞🏽 [23:10:59] hehe James_F hit an array access error [23:11:23] I did? [23:14:11] " " was a pregnant pause [23:14:37] terminal emoji support issues? :) [23:14:43] Fire retardant foam continues here, https://gerrit.wikimedia.org/r/556080 [23:15:24] awight: Doesn't that still need the ++? [23:15:30] aah that's a sad face, irssi usually has emojisen for me [23:16:24] James_F: line 200 is the reassignment [23:16:38] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [23:16:47] Did I mention that none of this should be possible, if PHP were to behave itself? [23:17:17] Oh, right, the increment happens on line 198. [23:17:28] C+2'ed. [23:18:31] Just to share how crazy some of these errors are, lines 115-116 assign to that sequence in the head of the function. [23:19:30] * James_F sighs. [23:19:34] PHP sucks. [23:20:09] at least a few things in life are constant. [23:20:16] RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [23:20:17] True. [23:20:40] (03PS4) 10BBlack: vcl: Bump TLSv1/TLSv1.1 pageview replacement to 100% [puppet] - 10https://gerrit.wikimedia.org/r/550870 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [23:23:32] cscott: To confirm, Parsoid/PHP isn't running "real" language variant requests in production, just testing/mirroring? "/srv/deployment/parsoid/…/Assert.php:217) Invariant failed: /srv/deployment/parsoid/…/langconv/…/fst/brack-zh-hans-noop.pfst" is the number one (and four) error in production right now. [23:25:10] (03CR) 10BBlack: [C: 03+2] vcl: Bump TLSv1/TLSv1.1 pageview replacement to 100% [puppet] - 10https://gerrit.wikimedia.org/r/550870 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [23:26:12] RECOVERY - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [23:27:34] brennen: Want me to deploy? [23:27:43] go to town. :) [23:28:50] Live on mwdebug1001. [23:29:14] "testing" [23:31:28] James_F: There's an error. FYI the winning URL for this error is https://en.wikipedia.org/w/api.php?action=parse&format=json&prop=text|displaytitle§ion=0&page=Rohingya_genocide&origin=* [23:32:04] awight: I get the same result on mwdebug1001 and without… [23:32:21] awight: Does it fix it for you? [23:32:32] (03CR) 10CRusnov: [C: 03+1] "LGTM optional nit inline (non-blocker)" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/554543 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [23:33:02] James_F: Well, I see errors at 23:29:58, so that means the fix didn't work, AFAICT. [23:33:14] awight: It's not deployed yet. [23:33:29] This is an error on mwdebug1001 [23:33:39] Ah, then in that case. [23:33:45] Thoughts? [23:34:12] One line before the error, we assign to the supposedly missing array key. [23:34:41] Is there a magic invisible UTF character in the code? [23:35:07] Maybe the operator precedence of ?? is weaker than = [23:35:08] (03PS1) 10BBlack: authdns: remove host:53 monitor listener [puppet] - 10https://gerrit.wikimedia.org/r/556081 (https://phabricator.wikimedia.org/T240285) [23:35:10] (03PS1) 10BBlack: dnsbox: move profile::standard up to role [puppet] - 10https://gerrit.wikimedia.org/r/556082 (https://phabricator.wikimedia.org/T240285) [23:35:37] not the case. [23:35:45] awight: Not according to https://www.php.net/manual/en/language.operators.precedence.php [23:35:50] +1 [23:37:07] (03CR) 10BBlack: [C: 03+2] authdns: remove host:53 monitor listener [puppet] - 10https://gerrit.wikimedia.org/r/556081 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [23:37:30] (03CR) 10BBlack: [C: 03+2] dnsbox: move profile::standard up to role [puppet] - 10https://gerrit.wikimedia.org/r/556082 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [23:38:15] awight: I just loaded https://en.wikipedia.org/w/api.php?action=parse&format=json&prop=text|displaytitle§ion=0&page=Rohingya_genocide&origin=* four times from mwdebug1001 and nothing in the error log? [23:39:04] Lots of errors! [23:40:30] That line should be guarded by both isset and ??, I don't know what else to throw at it... [23:40:43] other than some ECC memory [23:41:07] Might it be parser cache issues? [23:41:15] (03PS1) 10BBlack: dnsbox: remove include_auth switch [puppet] - 10https://gerrit.wikimedia.org/r/556083 (https://phabricator.wikimedia.org/T240285) [23:42:08] I don't believe any of Cite's state is kept in ParserCache. It's a monstrous, ad-hoc property $parser->extCite, but shouldn't be cached. [23:42:36] So my understanding is that we only ever share these internal values among one running instance of the server. [23:42:38] extCite isn't in the cache? [23:42:50] It shouldn't be, it's only in the Parser object. [23:42:58] Maybe it's being shared between different threads of PHP somehow? [23:43:09] *that* is possible [23:43:42] * brennen raises eyebrow [23:43:43] All the ""ReferenceStack.php: Return value of Cite\ReferenceStack::getGroupRefs() must be of the type array, null returned errors are from action=submits, so plausibly they're user error (but happening a lot). [23:43:46] It's also the only thing that begins to explain how evil... [23:44:00] fwiw, our servers do have ECC memory :) [23:44:08] brennen: Sounds like the kind of "efficient" optimisation someone might make. [23:44:26] bblack: And we wouldn't have cascading failures across the fleet but only in one set of code even so. [23:44:37] bblack: Now I need a fresh scapegoat! [23:45:24] (03CR) 10BBlack: [C: 03+2] dnsbox: remove include_auth switch [puppet] - 10https://gerrit.wikimedia.org/r/556083 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [23:46:20] sentient cosmic rays which know how to create synchronized memory errors across the fleet which evade ECC's simplistic algorithms? :) [23:47:12] awight: Aha! [23:47:26] awight: On https://en.wikipedia.org/wiki/Rohingya_genocide?veaction=editsource the "auto" reference is defined multiple times. [23:47:30] yes. [23:47:36] awight: Is that true for all the failing values? [23:48:48] James_F: I don't have enough example articles to say whether that's the case, but what if it is? [23:49:25] awight: Makes finding the breakage a lot simpler, I'd have though. [23:49:27] +t [23:49:30] PROBLEM - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [23:49:36] (03PS1) 10BBlack: dnsrecursor: rename ulimits to override [puppet] - 10https://gerrit.wikimedia.org/r/556084 (https://phabricator.wikimedia.org/T240285) [23:49:38] (03PS1) 10BBlack: dnsrecursor: add parameter bind_service [puppet] - 10https://gerrit.wikimedia.org/r/556085 (https://phabricator.wikimedia.org/T240285) [23:49:56] (03PS1) 10BBlack: dnsbox: replace glue with bind_service [puppet] - 10https://gerrit.wikimedia.org/r/556086 (https://phabricator.wikimedia.org/T240285) [23:50:26] (03PS1) 10BBlack: dnsbox: eliminate extra profile layer [puppet] - 10https://gerrit.wikimedia.org/r/556087 (https://phabricator.wikimedia.org/T240285) [23:50:39] inoright! But all it really means is that this is the only way to reach the code path we're looking at. [23:50:47] (03CR) 10Cwhite: [C: 03+2] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/556052 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [23:51:49] (03CR) 10jerkins-bot: [V: 04-1] dnsrecursor: add parameter bind_service [puppet] - 10https://gerrit.wikimedia.org/r/556085 (https://phabricator.wikimedia.org/T240285) (owner: 10BBlack) [23:52:53] bblack: dnsbox: remove include_auth switch (34363638fe) good to merge? [23:53:20] OK, so, options: [23:53:23] shdubsh: yes [23:53:31] ack [23:53:31] 1) Do nothing, leave wmf.8 very noisy, hope that wmf.10 fixes it. [23:53:43] 2) Revert Cite in wmf.8 back to wmf.5. [23:54:06] 3) Revert the whole train to wmf.5 for group2 and then… jump to wmf.10 on Thursday? Eesh. [23:54:18] (03PS2) 10BBlack: dnsrecursor: add parameter bind_service [puppet] - 10https://gerrit.wikimedia.org/r/556085 (https://phabricator.wikimedia.org/T240285) [23:54:20] (03PS2) 10BBlack: dnsbox: replace glue with bind_service [puppet] - 10https://gerrit.wikimedia.org/r/556086 (https://phabricator.wikimedia.org/T240285) [23:54:21] awight: What are you thinking? [23:54:22] (03PS2) 10BBlack: dnsbox: eliminate extra profile layer [puppet] - 10https://gerrit.wikimedia.org/r/556087 (https://phabricator.wikimedia.org/T240285) [23:54:49] I'm fine with (2). I can also offer another cheap patch to just... try to find the invisible levers here. [23:54:59] The level of logspam isn't good, but it's only back to where we were a year ago. AFAICS it's not interfering with user activity. [23:55:11] But I understand if another patch is not [23:55:13] I'm quite worried about wmf.10 given this, as so much code has changed (again). [23:55:27] #3 sounds like a high risk for wmf.10 [23:55:29] *I understand if a third hotfix is not an option, cos ridiculous. [23:55:39] Throwing endless patches at the problem without an RCA seems fruitless. [23:55:44] brennen: Yeah, that's my concern. [23:56:11] I'm minded to go with (1) for now. [23:56:13] +1 whatever an RCA is, I agree [23:56:21] Sorry, Root Cause Analysis. [23:56:28] I'm of course fine with (1) and (2) [23:56:30] Patching the symptoms rather than the cause. [23:56:31] i don't think this is an acceptable level of logspam to persist for very long, but #1 still seems like the lowest risk option at the moment. [23:56:39] Yeah. :-( [23:57:01] is the logspam going to crater logstash or something, or just human-level spam? [23:57:11] human level. [23:57:46] it's ignorable enough on that level, but this is definitely a setback for The War on Logspam. [23:58:48] lol sign me up for the next crusade. I can make it our team's highest priority to fix this week, as well as track the wmf.10 deployment. [23:59:10] Battle of the Bulge (of Errors). [23:59:14] if only the wars on drugs and terrorism hadn't depleted us of the heroin-fueled crusaders we needed to win the War on Logspam :/ [23:59:22] such are unintended consequences sometimes! [23:59:45] Send the children 8D