[00:00:04] Deploy window No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190612T0000) [00:13:47] RECOVERY - puppet last run on alcyone is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:36:03] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [00:50:33] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [02:41:31] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 140792976 and 9 seconds [02:44:25] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 93160 and 47 seconds [03:38:27] PROBLEM - Disk space on wezen is CRITICAL: DISK CRITICAL - free space: / 1770 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [06:27:05] RECOVERY - Disk space on wezen is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [06:30:31] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/sudoers] [06:43:24] (03PS2) 10Marostegui: Revert "db1077: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/516481 [06:43:45] (03CR) 10Marostegui: Revert "db1077: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/516481 (owner: 10Marostegui) [06:44:08] (03CR) 10Marostegui: [C: 03+2] Revert "db1077: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/516481 (owner: 10Marostegui) [06:45:28] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516589 (https://phabricator.wikimedia.org/T225391) [06:47:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516589 (https://phabricator.wikimedia.org/T225391) (owner: 10Marostegui) [06:47:59] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516589 (https://phabricator.wikimedia.org/T225391) (owner: 10Marostegui) [06:48:16] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516589 (https://phabricator.wikimedia.org/T225391) (owner: 10Marostegui) [06:49:24] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1077 after a crash (duration: 00m 49s) [06:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:39] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:31:05] (03PS2) 10Petar.petkovic: Remove Content Translation event logging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514672 [08:31:31] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-restart [08:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:22] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [08:41:29] (03PS4) 10CRusnov: Add Cumin backend for accessing Netbox [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 [08:41:40] !log pool map2003. reimage and setup is complete - T224395 [08:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:45] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [08:42:36] gehel: the cirrussearch update problem alert is you? [08:43:22] apergos: yep, that's me [08:43:39] looks like we need to add a downtime in our restart cookbook [08:44:49] thanks for the confirmation! [08:45:49] apergos: thanks for the ping! [08:45:56] :-) [08:47:11] (03CR) 10jerkins-bot: [V: 04-1] Add Cumin backend for accessing Netbox [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (owner: 10CRusnov) [08:47:51] 10Operations, 10Elasticsearch, 10Icinga, 10Discovery-Search (Current work): Create Icinga check that alerts whenever elasticsearch master is down - https://phabricator.wikimedia.org/T224073 (10Mathew.onipe) [08:48:11] 10Operations, 10Discovery-Search (Current work): Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10Mathew.onipe) [08:59:58] !log Gracefully stopping Zuul (kill -SIGUSR1) to prepare for the restart of the CI Jenkins T225322 [09:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:06] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.067e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [09:01:42] PROBLEM - Mjolnir bulk update failure check - codfw on icinga1001 is CRITICAL: 169 gt 2 https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [09:04:07] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10LucasWerkmeister) Perhaps you could also switch the sshd config to `AuthorizedKeysFile... [09:05:16] 10Operations, 10observability, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10faidon) p:05Normalβ†’03High a:03fgiunchedi Right now there are 14 outstanding alerts, or about 50% of all outstanding alerts: {F29... [09:05:58] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10ayounsi) [09:06:01] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10ayounsi) 05Resolvedβ†’03Open > DISK WARNING - free space: / 3588 MB (8% inode=63%): https://icing... [09:08:32] ACKNOWLEDGEMENT - Mjolnir bulk update failure check - codfw on icinga1001 is CRITICAL: 169.1 gt 2 Gehel cluster restart in progress https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [09:51:27] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.rolling-restart (exit_code=0) [09:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:06] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) >>! In T203786#5249703, @elukey wrote: > After the... [09:58:16] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516603 [09:59:15] (03PS1) 10Vgutierrez: x509: Expose the OCSP URI of a Certificate as a property [software/acme-chief] - 10https://gerrit.wikimedia.org/r/516604 (https://phabricator.wikimedia.org/T219765) [09:59:24] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516603 (owner: 10Marostegui) [10:00:18] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516603 (owner: 10Marostegui) [10:00:34] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516603 (owner: 10Marostegui) [10:01:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1077 after a crash (duration: 00m 48s) [10:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:46] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [10:05:28] (03CR) 10Marostegui: "I like the idea of being able to provide a given path, but I am not understanding what is the idea behind it, to be more precise this:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515064 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [10:14:52] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 17 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [10:17:52] !log akosiaris@deploy1001 scap-helm zotero upgrade --dry-run --debug production stable/zotero [namespace: zotero, clusters: eqiad,codfw] [10:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:58] !log akosiaris@deploy1001 scap-helm zotero cluster eqiad completed [10:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:05] !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed [10:18:05] !log akosiaris@deploy1001 scap-helm zotero finished [10:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:49] (03PS9) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [10:22:46] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [10:32:57] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516606 [10:34:02] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516606 (owner: 10Marostegui) [10:34:51] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516606 (owner: 10Marostegui) [10:36:02] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1077 after recovering from a crash (duration: 00m 47s) [10:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:39] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516606 (owner: 10Marostegui) [10:52:51] !log force-upgrade mtail to 3.0.0~rc24.1-1 on wezen - T225604 [10:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:56] T225604: log spam from mtail 3.0.0~rc19 on wezen - https://phabricator.wikimedia.org/T225604 [10:55:46] PROBLEM - DPKG on wezen is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:56:54] that's me ^ [11:00:47] for people who want to follow along, that's https://phabricator.wikimedia.org/T225604 [11:02:58] RECOVERY - DPKG on wezen is OK: All packages OK [11:04:32] (03PS1) 10Michael Große: Enable feature flag for breaking Wikibase API change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516608 (https://phabricator.wikimedia.org/T223303) [11:10:56] PROBLEM - Disk space on ms-be2018 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [11:12:00] sigh [11:12:39] I'll take care of it [11:13:50] RECOVERY - Disk space on ms-be2018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [11:33:09] thanks, was it that disk again? [11:42:41] indeed, T225613 opened for it [11:42:41] T225613: Swift / puppet interaction can fill up root filesystem - https://phabricator.wikimedia.org/T225613 [11:49:09] (03PS1) 10Filippo Giunchedi: swift: change ownership depending on mountpoint status [puppet] - 10https://gerrit.wikimedia.org/r/516615 (https://phabricator.wikimedia.org/T225613) [11:49:50] (03CR) 10jerkins-bot: [V: 04-1] swift: change ownership depending on mountpoint status [puppet] - 10https://gerrit.wikimedia.org/r/516615 (https://phabricator.wikimedia.org/T225613) (owner: 10Filippo Giunchedi) [11:55:43] !log swift eqiad-prod: put back ms-be1033 - T223518 [11:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:49] T223518: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518 [12:38:33] 10Operations, 10MediaWiki-Releasing, 10Parsoid, 10Release-Engineering-Team: debian signing keyid 90E9F83F22250DD7 has expired - https://phabricator.wikimedia.org/T225624 (10Kghbln) [12:43:10] PROBLEM - SSH on ms-be1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:45:54] RECOVERY - SSH on ms-be1024 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:57:07] 10Operations, 10MediaWiki-Releasing, 10Parsoid, 10Release-Engineering-Team: debian signing keyid 90E9F83F22250DD7 has expired - https://phabricator.wikimedia.org/T225624 (10RazeSoldier) [12:59:45] 10Operations, 10Wikimedia-Mailing-lists: gmail users being suspended from mediawiki-l due to excessive bounces - https://phabricator.wikimedia.org/T225553 (10Legoktm) Thanks Andre, that makes it clear that this is once again a DMARC problem. `nasa.gov` has `p=reject` set, so providers are (theoretically) right... [13:01:01] 10Operations, 10MediaWiki-Releasing, 10Parsoid, 10Release-Engineering-Team: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository - https://phabricator.wikimedia.org/T225601 (10Kghbln) [13:01:46] 10Operations, 10MediaWiki-Releasing, 10Parsoid, 10Release-Engineering-Team: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository - https://phabricator.wikimedia.org/T225601 (10Kghbln) Added projects as done in T141400 [13:02:34] 10Operations, 10MediaWiki-Releasing, 10Parsoid, 10Release-Engineering-Team: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository - https://phabricator.wikimedia.org/T225601 (10Legoktm) https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org#G... [13:02:50] 10Operations, 10MediaWiki-Releasing, 10Parsoid, 10Release-Engineering-Team: debian signing keyid 90E9F83F22250DD7 has expired - https://phabricator.wikimedia.org/T225624 (10Kghbln) @RazeSoldier Thanks. Did not see that. [13:12:53] (03PS1) 10Lokal Profil: Restrict uploading on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) [13:19:58] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-restart [13:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:46] 10Operations, 10Wikimedia-Mailing-lists: gmail users being suspended from mediawiki-l due to excessive bounces - https://phabricator.wikimedia.org/T225553 (10fgiunchedi) Indeed `Munge From` seems the least intrusive, AFAICT lists administrators should be able to self-set this option for the list to test it wor... [13:23:57] 10Operations, 10CommRel-Specialists-Support (Jan-Mar-2019), 10Goal, 10User-Johan: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10Johan) [13:23:59] 10Operations, 10User-Johan: 2018 data center switchover: Move all the things back to eqiad - https://phabricator.wikimedia.org/T200023 (10Johan) 05Openβ†’03Resolved [13:34:27] (03PS1) 10Ottomata: Convert geoip::data::archive to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/516625 [13:35:06] (03CR) 10jerkins-bot: [V: 04-1] Convert geoip::data::archive to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/516625 (owner: 10Ottomata) [13:37:24] (03PS1) 10Filippo Giunchedi: icinga: increase service_check / command_timeout by 11% [puppet] - 10https://gerrit.wikimedia.org/r/516627 (https://phabricator.wikimedia.org/T210723) [13:45:10] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) @EBernhardson mentioned a feature he wanted to me yesterday: a way to delete the swift objects after they... [13:45:38] (03PS2) 10Filippo Giunchedi: swift: change ownership depending on mountpoint status [puppet] - 10https://gerrit.wikimedia.org/r/516615 (https://phabricator.wikimedia.org/T225613) [13:47:41] (03PS3) 10Filippo Giunchedi: swift: change ownership depending on mountpoint status [puppet] - 10https://gerrit.wikimedia.org/r/516615 (https://phabricator.wikimedia.org/T225613) [13:48:40] (03PS2) 10Ottomata: Convert geoip::data::archive to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/516625 [13:49:00] (03PS3) 10Ottomata: Convert geoip::data::archive to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/516625 [13:49:16] (03PS4) 10Ottomata: Convert geoip::data::archive to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/516625 [13:53:21] (03CR) 10Cwhite: icinga: Add a script to parse and query the status.dat file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [13:54:15] (03CR) 10Elukey: [C: 03+1] "Left a nit but looks good!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/516625 (owner: 10Ottomata) [13:54:51] (03CR) 10Filippo Giunchedi: "Perhaps naive question, why hdfs as opposed to the hosts' filesystem? My understanding is that the swift-upload.sh script is invoked on th" [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata) [13:55:51] (03CR) 10Ottomata: "This could be done on the hosts filesystem, but then it would have to be deployed to all analytics workers, as we don't know which worker " [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata) [13:56:45] (03PS5) 10Ottomata: Convert geoip::data::archive to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/516625 [13:56:48] (03CR) 10Ottomata: Convert geoip::data::archive to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/516625 (owner: 10Ottomata) [14:00:50] (03CR) 10Filippo Giunchedi: "> Patch Set 9:" [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata) [14:00:51] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/16952/stat1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/516625 (owner: 10Ottomata) [14:00:53] (03CR) 10Ottomata: [C: 03+2] Convert geoip::data::archive to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/516625 (owner: 10Ottomata) [14:02:10] (03PS1) 10Ottomata: Remove unused geoip archive cron [puppet] - 10https://gerrit.wikimedia.org/r/516629 [14:03:10] (03CR) 10Ottomata: [C: 03+2] Remove unused geoip archive cron [puppet] - 10https://gerrit.wikimedia.org/r/516629 (owner: 10Ottomata) [14:04:42] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10fgiunchedi) >>! In T210723#5252582, @faidon wrote: > Right now there are 14 outstanding alerts, or about 50% of... [14:08:33] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10fgiunchedi) Could you try running the upgrade twice ? But yes if the jessie image is updated then it should come with the latest rsyslog I... [14:13:41] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10fgiunchedi) We could use swift's expiring objects support, although that is something we'd have to deploy first (pu... [14:21:40] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) 05Stalledβ†’03Resolved [14:24:25] 10Operations, 10ops-codfw: ms-be2018 sdc unreadable sector - https://phabricator.wikimedia.org/T225630 (10fgiunchedi) [14:24:38] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) That would be perfect! [14:27:30] 10Operations, 10ops-codfw: ms-be2018 sdc unreadable sector - https://phabricator.wikimedia.org/T225630 (10fgiunchedi) Also forcibly remove the physical disk ` array C physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 4000.7 GB, OK) => pd 1I:1:1 modify disablepd Warning: The physical drive will be di... [14:31:51] (03CR) 10Vgutierrez: "Please could we get this rebased on top of the production branch? Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson) [14:31:53] (03PS1) 10Marostegui: db-eqiad.php: Give more traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516632 [14:33:06] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give more traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516632 (owner: 10Marostegui) [14:33:51] (03PS1) 10Matthias Mullie: Increase rate limits for newbies on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516633 (https://phabricator.wikimedia.org/T225148) [14:34:01] (03Merged) 10jenkins-bot: db-eqiad.php: Give more traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516632 (owner: 10Marostegui) [14:34:21] (03CR) 10jenkins-bot: db-eqiad.php: Give more traffic to db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516632 (owner: 10Marostegui) [14:34:57] (03CR) 10Matthias Mullie: [C: 04-1] "-1 while this is being discussed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516633 (https://phabricator.wikimedia.org/T225148) (owner: 10Matthias Mullie) [14:35:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1077 after recovering from a crash (duration: 00m 47s) [14:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:16] !log Start replication on all threads on labsdb1010 - T222978 [14:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:21] T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 [14:42:35] PROBLEM - Check systemd state on ms-be1042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:46:37] PROBLEM - puppet last run on ms-be2018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 15 seconds ago with 1 failures. Failed resources (up to 3 shown) [14:51:02] ACKNOWLEDGEMENT - HP RAID on ms-be2018 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:1 - OK: 2I:4:1, 2I:4:2, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T225633 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:51:06] 10Operations, 10ops-codfw: Degraded RAID on ms-be2018 - https://phabricator.wikimedia.org/T225633 (10ops-monitoring-bot) [14:52:00] 10Operations, 10ops-codfw: Degraded RAID on ms-be2018 - https://phabricator.wikimedia.org/T225633 (10fgiunchedi) [14:52:02] 10Operations, 10ops-codfw: ms-be2018 sdc unreadable sector - https://phabricator.wikimedia.org/T225630 (10fgiunchedi) [15:01:52] (03PS2) 10Alaa Sarhan: Add new terms normalized schema tables as public 1:1 views in labs. [puppet] - 10https://gerrit.wikimedia.org/r/514411 (https://phabricator.wikimedia.org/T225038) [15:08:22] PROBLEM - Device not healthy -SMART- on ms-be2018 is CRITICAL: cluster=swift device=cciss,13 instance=ms-be2018:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2018&var-datasource=codfw+prometheus/ops [15:14:53] !log gehel@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-restart (exit_code=97) [15:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:38] RECOVERY - Check systemd state on ms-be1042 is OK: OK - running: The system is fully operational [15:17:44] (03PS3) 10Gehel: A more flexible approach for mjolnir update lag [puppet] - 10https://gerrit.wikimedia.org/r/516526 (https://phabricator.wikimedia.org/T214494) (owner: 10Mathew.onipe) [15:18:30] (03PS1) 10Jcrespo: labsdb: Move labsdb1010 from analytics to web to ease the extra load [puppet] - 10https://gerrit.wikimedia.org/r/516639 [15:18:39] (03CR) 10Gehel: [C: 03+2] A more flexible approach for mjolnir update lag [puppet] - 10https://gerrit.wikimedia.org/r/516526 (https://phabricator.wikimedia.org/T214494) (owner: 10Mathew.onipe) [15:21:30] (03CR) 10Gehel: [C: 03+1] "LGTM, let's see if volans has some more comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) (owner: 10Mathew.onipe) [15:22:51] (03PS2) 10Jcrespo: labsdb: Move labsdb1010 from analytics to web to ease the extra load [puppet] - 10https://gerrit.wikimedia.org/r/516639 [15:26:56] (03CR) 10Marostegui: labsdb: Move labsdb1010 from analytics to web to ease the extra load (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/516639 (owner: 10Jcrespo) [15:30:56] (03CR) 10Gehel: [C: 04-1] "A few minor comments inline." (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [15:37:41] !log re-enabled bawolff's gerrit account [15:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:09] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-restart [15:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:29] (03PS1) 10Fdans: Blacklist the PageIssues schema [puppet] - 10https://gerrit.wikimedia.org/r/516644 [16:00:31] (03PS5) 10Elukey: Allow Hadoop-related profiles to deploy Kerberos keytabs [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) [16:01:46] (03CR) 10Elukey: "Andrew: Tried to DRY up as much as possible, let me know your thoughts :)" [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [16:05:50] 10Operations, 10Wikimedia-Mailing-lists: gmail users being suspended from mediawiki-l due to excessive bounces - https://phabricator.wikimedia.org/T225553 (10Aklapper) Thanks. I changed the value of `Action to take when anyone posts to the list from a domain with a DMARC Reject/Quarantine Policy` (confusingly... [16:06:22] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-restart (exit_code=0) [16:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:30] (03PS2) 10Fdans: Eventlogging - Blacklist the PageIssues schema [puppet] - 10https://gerrit.wikimedia.org/r/516644 [16:14:01] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10RStallman-legalteam) All have signed the NDAs now. Many thanks! [16:17:32] 10Operations, 10ops-ulsfo, 10netbox: ulsfo netbox updates - https://phabricator.wikimedia.org/T221785 (10RobH) 05Openβ†’03Resolved [16:18:50] 10Operations, 10ops-ulsfo, 10netbox: ulsfo netbox updates - https://phabricator.wikimedia.org/T221785 (10RobH) I'm not sure who added the atlas-ulsfo serial since I commented I couldn't get it. Either I pulled it out of the rack to do it (doubtful or I'd have updated this task) or someone else pulled it som... [16:19:14] 10Operations, 10ops-ulsfo, 10netbox: ulsfo netbox updates - https://phabricator.wikimedia.org/T221785 (10RobH) a:05RobHβ†’03None [16:25:05] (03CR) 10Ottomata: [C: 03+2] Eventlogging - Blacklist the PageIssues schema [puppet] - 10https://gerrit.wikimedia.org/r/516644 (owner: 10Fdans) [16:31:10] 10Operations, 10Performance-Team: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10elukey) p:05Triageβ†’03Normal [16:32:25] 10Operations, 10Performance-Team, 10Availability (MediaWiki-MultiDC): Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10Krinkle) [16:33:12] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10Krinkle) [16:44:43] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:48:56] (03PS1) 10Esanders: Turn off mobile-ab test for VE section editing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516659 (https://phabricator.wikimedia.org/T225645) [16:49:21] (03CR) 10Esanders: [C: 04-2] "Awaiting confirmation from analytics" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516659 (https://phabricator.wikimedia.org/T225645) (owner: 10Esanders) [16:50:27] RECOVERY - Check systemd state on ms-be1036 is OK: OK - running: The system is fully operational [17:25:04] 10Operations: wikipedia.com has invalid certificate - https://phabricator.wikimedia.org/T225650 (10MichaelSchoenitzer) [17:27:38] (03PS4) 10EBernhardson: LVS for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) [17:53:12] 10Operations, 10Wikimedia-Mailing-lists: gmail users being suspended from mediawiki-l due to excessive bounces - https://phabricator.wikimedia.org/T225553 (10Quiddity) Is this something that should (and can?) be set globally on all our lists? E.g. There were a bunch from AOL/Yahoo on the Xmldatadumps list. [17:58:44] the bunch from yahoo/aol might have been from the mass signup a whle back of bogus accounts [18:13:15] (03PS4) 10Jforrester: Enable TimedMediaHandler's new video player Beta Feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354390 (https://phabricator.wikimedia.org/T148103) [18:13:18] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.8/extensions/ArticlePlaceholder/includes/: T207235 / a42aa1599a131c55304 (duration: 00m 49s) [18:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:24] T207235: PHP error "Undefined index: Q…" from ArticlePlaceholder hook on SpecialSearch - https://phabricator.wikimedia.org/T207235 [18:14:33] (03CR) 10Jforrester: [C: 03+1] "Planned for deployment after the train next week (so after 2019-06-20)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354390 (https://phabricator.wikimedia.org/T148103) (owner: 10Jforrester) [18:15:12] (03CR) 10Brion VIBBER: [C: 03+1] "Ready for next week's deploy :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354390 (https://phabricator.wikimedia.org/T148103) (owner: 10Jforrester) [18:15:55] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.8/thumb.php: T225197 / 06b631fae5 (duration: 00m 47s) [18:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:02] T225197: "PHP Warning: Cannot modify header information - headers already sent" from /w/thumb.php - https://phabricator.wikimedia.org/T225197 [18:16:36] Krinkle: That's not really the level of UBN that's OK to deploy when there's no SREs to fix opcache corruption issues. [18:21:39] sigh [18:36:14] James_F: Those have been fixed per SRE 1+ week ago. Either way, the mediation is simple - opcache reset from CLI, which I have access to (perf-roots) and have done before. For better or worse, we do not have any other mediation. [18:36:38] furtunately, we haven't seen them in over a week, so looks like joe's fix worked. [18:36:52] No. We've not seen them for a week because we've not deployed anything for a week. [18:36:55] we are in a deployment freeze [18:37:10] A deployment freeze does not mean "except when you feel it's justified". [18:40:04] (And yes, it's incredibly disruptive to have a deployment freeze, and I've got a dozen or so things to do that I hope I'll remember by next week.) [18:51:06] I wasn't referring to the days during which we didn't deploy, obviously. The issue was fixed May 13, well over a week ago with many deploys since that each confirm it (including earlier today wrt the db failover). [18:51:42] (and as mentioned, opcache is something our team has helped with as well) [18:52:01] but yes this couldve waited and perhaps the calculation/risk assessment should've swung that way. [18:55:45] the opcache restart script is still in progress [18:55:57] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/514673/ [18:56:12] anyways it is 10 pm here, and as the last sre standing I am off for the night [18:59:54] Thanks, yeah. I forgot about the the issue where opcache can still corrupts itself - unattended - after it reaches a certain size after a certain amount of time has ellapsed since the last deployment. [19:00:14] What we fixed is the issue where any deployment can cause a percentage of PHP7 servers to become corrupt requiring a manual restart. [19:00:31] (for some definition of "fixed", really something upstream needs to get a handle on, but unlikely to happen soon) [19:04:41] yep and yep [19:04:45] and now really gone :-) [19:10:40] 10Operations, 10Services, 10Service-deployment-requests: Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10Mholloway) [19:11:22] 10Operations, 10Services, 10Service-deployment-requests: Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10Mholloway) [19:16:59] PROBLEM - puppet last run on prometheus2004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [19:43:06] 10Operations, 10Performance-Team, 10Availability (MediaWiki-MultiDC): Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10Krinkle) [19:43:39] 10Operations, 10Performance-Team, 10Availability (MediaWiki-MultiDC): Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10Krinkle) >>! @elukey wrote at > whenever you have time let me k... [19:44:11] RECOVERY - puppet last run on prometheus2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:58:46] (03CR) 10Aaron Schulz: "Rebase conflict" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [20:12:17] (03CR) 10Aaron Schulz: [C: 03+1] mcrouter: allow async foreign set/delete WAN cache operations [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [21:21:25] (03CR) 10Herron: [C: 03+1] icinga: increase service_check / command_timeout by 11% [puppet] - 10https://gerrit.wikimedia.org/r/516627 (https://phabricator.wikimedia.org/T210723) (owner: 10Filippo Giunchedi) [21:39:26] (03CR) 10Jforrester: "check experimental" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/514761 (owner: 10L10n-bot) [21:39:44] (03CR) 10jenkins-bot: Localisation updates from https://translatewiki.net. [dumps/dcat] - 10https://gerrit.wikimedia.org/r/514761 (owner: 10L10n-bot) [21:54:30] (03CR) 10Jforrester: "recheck" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/514761 (owner: 10L10n-bot) [22:06:29] 10Operations, 10Release-Engineering-Team: GPG Key expired apt repository - https://phabricator.wikimedia.org/T225677 (10Reedy) [22:06:42] 10Operations, 10Release-Engineering-Team: GPG Key expired apt repository - https://phabricator.wikimedia.org/T225677 (10Aklapper) [22:06:47] 10Operations, 10MediaWiki-Releasing, 10Parsoid, 10Release-Engineering-Team: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository - https://phabricator.wikimedia.org/T225601 (10Aklapper) [22:51:30] (03PS3) 10Jcrespo: labsdb: Move labsdb1010 from analytics to web to ease the extra load [puppet] - 10https://gerrit.wikimedia.org/r/516639