[00:08:06] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-kafka-jumbo-[12] due to version of a package being missing - https://phabricator.wikimedia.org/T184240#3881725 (10Krenair) [00:27:23] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10media-storage: Puppet broken on deployment-ms-be0[34] with evaluation error in swift module - https://phabricator.wikimedia.org/T184236#3881727 (10Krenair) Looks like the reason is we have an old broken version of https://gerrit.wikimedia.org/r/#/c/3... [00:43:55] (03CR) 10Alex Monk: swift: use implicit /dev/swift prefix for swift devices (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [00:44:03] (03CR) 10Alex Monk: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [00:44:25] (03CR) 10jerkins-bot: [V: 04-1] swift: use implicit /dev/swift prefix for swift devices [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [00:50:43] (03PS9) 10Alex Monk: swift: use implicit /dev/swift prefix for swift devices [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [00:55:29] (03PS1) 10Alex Monk: swift: Fix checks on drive/filesystem titles to allow for labs ones [puppet] - 10https://gerrit.wikimedia.org/r/402758 (https://phabricator.wikimedia.org/T184236) [00:57:53] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10media-storage, 10Patch-For-Review: Puppet broken on deployment-ms-be0[34] with evaluation error in swift module - https://phabricator.wikimedia.org/T184236#3881742 (10Krenair) a:03Krenair Found a syntax problem in the latest version of it too (je... [01:50:21] PROBLEM - very high load average likely xfs on ms-be2037 is CRITICAL: CRITICAL - load average: 226.35, 101.26, 51.77 [01:55:42] PROBLEM - MD RAID on ms-be2037 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [01:55:43] ACKNOWLEDGEMENT - MD RAID on ms-be2037 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T184390 [01:55:47] 10Operations, 10ops-codfw: Degraded RAID on ms-be2037 - https://phabricator.wikimedia.org/T184390#3881751 (10ops-monitoring-bot) [01:57:31] PROBLEM - Disk space on ms-be2037 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb4 is not accessible: Input/output error [02:02:31] PROBLEM - Host ms-be2037 is DOWN: PING CRITICAL - Packet loss = 100% [02:33:43] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.15) (duration: 06m 17s) [02:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:07] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 736.11 seconds [03:55:16] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 206.37 seconds [04:19:36] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [04:20:16] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [04:31:16] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [04:31:36] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [06:09:59] (03PS7) 10Albert221: Remove language button from Wikidata and MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402643 (https://phabricator.wikimedia.org/T183665) [06:11:56] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [06:12:37] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [06:17:58] !log Deploy schema change on s7 primary master (db1062) - T174569 [06:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:09] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:19:31] (03PS1) 10Gergő Tisza: Configure the Swift file backend to accept X-MediaWiki- headers. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402763 [06:27:46] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [06:28:47] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.017 second response time [06:32:26] !log Disable auto-learn on db1011 [06:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:26] RECOVERY - MegaRAID on db1011 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [06:41:00] (03PS1) 10Marostegui: db-eqiad.php: Depool db1067, db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402764 (https://phabricator.wikimedia.org/T162807) [06:43:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067, db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402764 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:44:32] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1067, db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402764 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:45:57] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 and db1089 - T162807 (duration: 00m 51s) [06:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:09] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [06:46:43] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067, db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402764 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:51:07] !log Stop replication in sync on db1067 and db1089 - T162807 [06:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:17] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [06:56:30] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1039 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402765 (https://phabricator.wikimedia.org/T184262) [07:00:56] (03PS1) 10Marostegui: mariadb: Remove db1039 [puppet] - 10https://gerrit.wikimedia.org/r/402766 (https://phabricator.wikimedia.org/T184262) [07:01:00] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db1039 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402765 (https://phabricator.wikimedia.org/T184262) (owner: 10Marostegui) [07:03:06] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1039 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402765 (https://phabricator.wikimedia.org/T184262) (owner: 10Marostegui) [07:03:20] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1039 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402765 (https://phabricator.wikimedia.org/T184262) (owner: 10Marostegui) [07:04:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1039 as it will be decommissioned - T184262 (duration: 00m 50s) [07:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:44] T184262: Decommission db1039 - https://phabricator.wikimedia.org/T184262 [07:05:29] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db1039 as it will be decommissioned - T184262 (duration: 00m 50s) [07:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:11] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/9609/" [puppet] - 10https://gerrit.wikimedia.org/r/402766 (https://phabricator.wikimedia.org/T184262) (owner: 10Marostegui) [07:09:22] (03PS2) 10Marostegui: mariadb: Remove db1039 [puppet] - 10https://gerrit.wikimedia.org/r/402766 (https://phabricator.wikimedia.org/T184262) [07:13:04] (03CR) 10Marostegui: [C: 032] mariadb: Remove db1039 [puppet] - 10https://gerrit.wikimedia.org/r/402766 (https://phabricator.wikimedia.org/T184262) (owner: 10Marostegui) [07:17:10] !log Remove db1039 from tendril - T184262 [07:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:21] T184262: Decommission db1039 - https://phabricator.wikimedia.org/T184262 [07:24:13] !log Stop MySQL on db1039 for decommission - T184262 [07:24:15] (03PS1) 10Marostegui: s7.hosts: Remove db1039 [software] - 10https://gerrit.wikimedia.org/r/402768 (https://phabricator.wikimedia.org/T184262) [07:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:23] T184262: Decommission db1039 - https://phabricator.wikimedia.org/T184262 [07:25:41] (03CR) 10Marostegui: [C: 032] s7.hosts: Remove db1039 [software] - 10https://gerrit.wikimedia.org/r/402768 (https://phabricator.wikimedia.org/T184262) (owner: 10Marostegui) [07:26:22] (03Merged) 10jenkins-bot: s7.hosts: Remove db1039 [software] - 10https://gerrit.wikimedia.org/r/402768 (https://phabricator.wikimedia.org/T184262) (owner: 10Marostegui) [07:27:08] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Decommission db1039 - https://phabricator.wikimedia.org/T184262#3881876 (10Marostegui) a:05Marostegui>03Cmjohnson db1039 is now ready to be decommissioned by @Cmjohnson [07:41:14] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0 [07:43:15] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [07:45:04] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [07:53:50] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3881906 (10MarcoAurelio) [08:05:55] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3881949 (10Marostegui) [08:10:08] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1039 - https://phabricator.wikimedia.org/T184262#3881976 (10Marostegui) [08:27:24] (03PS1) 10MarcoAurelio: translationadmin: remove configuration equal to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402780 (https://phabricator.wikimedia.org/T184314) [08:30:24] PROBLEM - MegaRAID on db1011 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [08:30:47] !log installing remaining openssl updates [08:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:12] 10Operations, 10Developer-Relations (Jan-Mar-2018), 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3882018 (10Qgil) If replying via email is a wanted feature, then it should be discussed in a separate task blocking {T180853}... [08:43:57] (03PS2) 10Urbanecm: Initial configuration for inhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402658 (https://phabricator.wikimedia.org/T184374) [08:44:00] (03PS5) 10Jcrespo: Add cron job for purging ReadingLists data [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [08:44:52] (03PS3) 10Urbanecm: Initial configuration for inhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402658 (https://phabricator.wikimedia.org/T184374) [08:45:24] (03CR) 10Jcrespo: [C: 032] Add cron job for purging ReadingLists data [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [08:45:45] (03PS4) 10Urbanecm: Initial configuration for inhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402658 (https://phabricator.wikimedia.org/T184374) [08:47:35] (03PS2) 10Elukey: profile::analytics::database::meta::backup_dest: allow labs dir perms [puppet] - 10https://gerrit.wikimedia.org/r/402382 (https://phabricator.wikimedia.org/T166248) [08:47:50] (03PS5) 10Urbanecm: Initial configuration for inhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402658 (https://phabricator.wikimedia.org/T184374) [08:48:04] 10Operations, 10Commons, 10Multimedia, 10Traffic, and 4 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3337532 (10Gilles) Given that the vast majority of the abuse was with video files, transcode support seems like a must have. As it stands... [08:49:13] (03CR) 10Jcrespo: "Please check if the following is correct:" [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [08:50:18] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9610/" [puppet] - 10https://gerrit.wikimedia.org/r/402382 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [09:13:07] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-netbox, looks like it thinks its a prod box - https://phabricator.wikimedia.org/T184242#3882064 (10hashar) Seems like deployment-netbox fails to setup the LetsEncrypt certificate because it is coded to use the production URL (netbox.wikimedia... [09:13:25] PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [09:13:38] (03PS1) 10Elukey: profile::hadoop::backup::namenode: improve labs support [puppet] - 10https://gerrit.wikimedia.org/r/402783 (https://phabricator.wikimedia.org/T166248) [09:13:52] 10Operations, 10ops-eqiad, 10DBA: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3882079 (10Marostegui) `˜/icinga-wm 10:13> PROBLEM - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough` [09:14:00] !log Force BBU relearn on db1059 - T184160 [09:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:12] T184160: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160 [09:17:21] <_joe_> !log starting 3 manual loops for consuming refreshLinks jobs for ruwiki [09:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:09] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9611/" [puppet] - 10https://gerrit.wikimedia.org/r/402783 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [09:19:18] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Lower varnish caching length on doc.wikimedia.org - https://phabricator.wikimedia.org/T184255#3877424 (10ema) Yes Apache should send the `Cache-Control` header for that purpose. Eg: `Cache-control: s-maxage=3600, must-revalidate, max-age=0` [09:19:20] (03PS2) 10Giuseppe Lavagetto: graphite: reorganize roles, one role() call per node [puppet] - 10https://gerrit.wikimedia.org/r/402388 [09:19:22] (03PS2) 10Giuseppe Lavagetto: role::installserver: create meta-role for installserver [puppet] - 10https://gerrit.wikimedia.org/r/402389 [09:19:24] (03PS2) 10Giuseppe Lavagetto: site.pp: one role() call for iron.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/402390 [09:19:26] (03PS1) 10Giuseppe Lavagetto: kafka: create compound roles, one role() call per node definition [puppet] - 10https://gerrit.wikimedia.org/r/402784 [09:19:28] (03PS1) 10Giuseppe Lavagetto: kripton: one role() call [puppet] - 10https://gerrit.wikimedia.org/r/402785 [09:19:32] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Lower varnish caching length on doc.wikimedia.org - https://phabricator.wikimedia.org/T184255#3882097 (10ema) p:05Triage>03Normal [09:19:33] (03PS1) 10Giuseppe Lavagetto: site.pp: reorganize labs host to use one role() call per node [puppet] - 10https://gerrit.wikimedia.org/r/402786 [09:19:35] (03PS1) 10Giuseppe Lavagetto: logstash: create compound role [puppet] - 10https://gerrit.wikimedia.org/r/402787 [09:19:37] (03PS1) 10Giuseppe Lavagetto: site.pp: fix more cases of multiple roles being declared [puppet] - 10https://gerrit.wikimedia.org/r/402788 [09:19:39] (03PS1) 10Giuseppe Lavagetto: site.pp: rationalize prometheus, puppetmaster frontends [puppet] - 10https://gerrit.wikimedia.org/r/402789 [09:20:34] (03PS4) 10Hashar: test: puppet-syntax now fails on deprecation notices [puppet] - 10https://gerrit.wikimedia.org/r/333012 (https://phabricator.wikimedia.org/T154915) [09:20:54] ACKNOWLEDGEMENT - MegaRAID on db1011 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough Jcrespo https://phabricator.wikimedia.org/T184401 [09:23:58] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2037 - https://phabricator.wikimedia.org/T184390#3882104 (10Volans) [09:24:38] !log cache_misc: upgrade to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267 [09:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:48] T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656 [09:26:06] (03CR) 10Markusguenther: "Hi Markus here," [puppet] - 10https://gerrit.wikimedia.org/r/402665 (owner: 10Paladox) [09:26:33] (03CR) 10Markusguenther: "And great that you like it and adapt it :)" [puppet] - 10https://gerrit.wikimedia.org/r/402665 (owner: 10Paladox) [09:28:07] (03CR) 10Hashar: "Running it with a rakefile having solely:" [puppet] - 10https://gerrit.wikimedia.org/r/333012 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar) [09:28:51] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Get rid of "import realm.pp" in manifests/site.pp - https://phabricator.wikimedia.org/T154915#3882111 (10hashar) Pending https://gerrit.wikimedia.org/r/#/c/333012/ to have puppet-syntax to f... [09:30:34] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9612/" [puppet] - 10https://gerrit.wikimedia.org/r/402388 (owner: 10Giuseppe Lavagetto) [09:30:42] (03PS3) 10Giuseppe Lavagetto: graphite: reorganize roles, one role() call per node [puppet] - 10https://gerrit.wikimedia.org/r/402388 [09:36:23] (03PS1) 10Elukey: profile::analytics::database::meta: simplify labs deployment [puppet] - 10https://gerrit.wikimedia.org/r/402791 (https://phabricator.wikimedia.org/T166248) [09:38:14] (03PS2) 10Volans: wmf-auto-reimage: improve resume capabilities [puppet] - 10https://gerrit.wikimedia.org/r/399161 (https://phabricator.wikimedia.org/T182702) [09:38:49] !log upgrading contint1001 / contint1002 | T184267 [09:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:05] PROBLEM - puppet last run on mw2244 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [09:39:28] !log set sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=65 to mw1261,mw2251,mw1276 and all videoscalers (Recently rebooted/reimaged) [09:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:02] (03PS3) 10MarcoAurelio: Allow eswiki bureaucrats to add/remove 'accountcreator' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395775 (https://phabricator.wikimedia.org/T182201) [09:43:22] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9613/" [puppet] - 10https://gerrit.wikimedia.org/r/402791 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [09:45:55] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1067, db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402792 [09:46:05] (03CR) 10Marostegui: [C: 04-2] "wait for db1067 to catch up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402792 (owner: 10Marostegui) [09:46:08] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1067, db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402792 [09:46:22] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2037 - https://phabricator.wikimedia.org/T184390#3882186 (10fgiunchedi) ``` ** 8 printk messages dropped ** [5424311.775321] sd 0:1:0:0: rejecting I/O to offline device [5424311.832004] sd 0:1:0:0: rejecting I/O to offline device ** 8 printk m... [09:46:31] !log reboot ms-be2037 - T184390 [09:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:42] T184390: Degraded RAID on ms-be2037 - https://phabricator.wikimedia.org/T184390 [09:46:47] !log rebooting CI [09:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:23] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2037 - https://phabricator.wikimedia.org/T184390#3882189 (10fgiunchedi) ``` Slot 3 Port 1 : Smart Array P840 Controller - (4096 MB, V4.52) 14 Logical Drive(s) - Operation Failedit, this may take a few moments.... - 1719-Slot 3 Drive Array - A... [09:49:54] RECOVERY - Disk space on ms-be2037 is OK: DISK OK [09:49:54] RECOVERY - very high load average likely xfs on ms-be2037 is OK: OK - load average: 18.95, 4.22, 1.38 [09:49:54] RECOVERY - MD RAID on ms-be2037 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [09:49:54] RECOVERY - Host ms-be2037 is UP: PING OK - Packet loss = 0%, RTA = 36.94 ms [09:53:33] !log Flashing Smart Array P840 in Slot 3 [ 4.52 -> 6.06 ] on ms-be2037 - T184390 T141756 [09:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:48] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [09:53:48] T184390: Degraded RAID on ms-be2037 - https://phabricator.wikimedia.org/T184390 [09:53:53] (03PS3) 10Giuseppe Lavagetto: role::installserver: create meta-role for installserver [puppet] - 10https://gerrit.wikimedia.org/r/402389 [09:54:01] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9614/install1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/402389 (owner: 10Giuseppe Lavagetto) [09:56:50] (03PS3) 10Giuseppe Lavagetto: site.pp: one role() call for iron.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/402390 [09:58:01] (03CR) 10Giuseppe Lavagetto: [C: 032] site.pp: one role() call for iron.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/402390 (owner: 10Giuseppe Lavagetto) [09:59:42] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2037 - https://phabricator.wikimedia.org/T184390#3882238 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Looks like controller locked up and mdadm kicked one disk out of the array? Upon reboot the ssd show up and healthy (according to th... [10:00:54] (03PS2) 10Giuseppe Lavagetto: kafka: create compound roles, one role() call per node definition [puppet] - 10https://gerrit.wikimedia.org/r/402784 [10:03:07] !log fixing wrong events on db2039, db1071,db2023, db2045, db2052, db1100 [10:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:06] RECOVERY - puppet last run on mw2244 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [10:04:41] !log drain + reboot analytics1029,1031->1034 for kernel updates [10:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:16] (03PS1) 10Hashar: contint: reenable overlay/overlayfs kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/402797 (https://phabricator.wikimedia.org/T184410) [10:15:56] (03CR) 10Jcrespo: "Should we consider the warmup now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393588 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [10:17:12] (03PS2) 10Hashar: contint: reenable overlay/overlayfs kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/402797 (https://phabricator.wikimedia.org/T184410) [10:17:33] <_joe_> hashar: I think simply contint1001 was never rebooted since blacklisting overlayfs [10:17:51] <_joe_> what you wrote in the commit message is wrong :) [10:20:03] (03CR) 10Muehlenhoff: [C: 031] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/402797 (https://phabricator.wikimedia.org/T184410) (owner: 10Hashar) [10:20:06] (03CR) 10Muehlenhoff: [C: 032] contint: reenable overlay/overlayfs kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/402797 (https://phabricator.wikimedia.org/T184410) (owner: 10Hashar) [10:20:21] _joe_: yeah I am chatting with Moritz about it in private. I was missing a bit and rephrased the commit message [10:25:02] (03CR) 10Marostegui: "> Should we consider the warmup now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393588 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [10:26:06] (03PS3) 10Giuseppe Lavagetto: kafka: create compound roles, one role() call per node definition [puppet] - 10https://gerrit.wikimedia.org/r/402784 [10:26:08] !log Started docker on contint1001 / contint2001 . They were missing the overlay/overlayfs kernel modules | T184410 [10:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:48] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1067, db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402792 (owner: 10Marostegui) [10:30:29] RECOVERY - MegaRAID on db1011 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [10:31:08] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067, db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402792 (owner: 10Marostegui) [10:31:22] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067, db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402792 (owner: 10Marostegui) [10:32:29] (03CR) 10Muehlenhoff: [C: 031] "That's fine. In production experimental is only used on role::cache::misc and role::cache::canary (for the Varnish 5 migration) and those " [puppet] - 10https://gerrit.wikimedia.org/r/402432 (https://phabricator.wikimedia.org/T184239) (owner: 10Paladox) [10:32:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 and db1089 - T162807 (duration: 00m 50s) [10:32:33] (03CR) 10Giuseppe Lavagetto: [C: 032] kafka: create compound roles, one role() call per node definition [puppet] - 10https://gerrit.wikimedia.org/r/402784 (owner: 10Giuseppe Lavagetto) [10:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:40] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [10:32:49] PROBLEM - Check Varnish expiry mailbox lag on cp4021 is CRITICAL: CRITICAL: expiry mailbox lag is 2069384 [10:36:05] (03PS4) 10ArielGlenn: apt: Do not use experimental on stretch [puppet] - 10https://gerrit.wikimedia.org/r/402432 (https://phabricator.wikimedia.org/T184239) (owner: 10Paladox) [10:37:20] (03CR) 10ArielGlenn: [C: 032] apt: Do not use experimental on stretch [puppet] - 10https://gerrit.wikimedia.org/r/402432 (https://phabricator.wikimedia.org/T184239) (owner: 10Paladox) [10:39:00] (03CR) 10Faidon Liambotis: [C: 04-1] "Much better than previous iterations (kudos!) but overall it still feels like a bash script written in Python: it's not leveraging Python " (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [10:40:46] !log akosiaris@tin Started deploy [servermon/servermon@53b81d8]: Update servermon [10:40:48] !log akosiaris@tin Finished deploy [servermon/servermon@53b81d8]: Update servermon (duration: 00m 02s) [10:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:35] (03PS1) 10Marostegui: db-eqiad.php: Warm up future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402801 (https://phabricator.wikimedia.org/T177208) [10:42:42] (03PS2) 10Giuseppe Lavagetto: kripton: one role() call [puppet] - 10https://gerrit.wikimedia.org/r/402785 [10:42:59] jynus: ^ [10:43:24] let me amend some changes first [10:44:22] (03PS2) 10Marostegui: db-eqiad.php: Warm up future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402801 (https://phabricator.wikimedia.org/T177208) [10:44:45] (03CR) 10Giuseppe Lavagetto: [C: 032] kripton: one role() call [puppet] - 10https://gerrit.wikimedia.org/r/402785 (owner: 10Giuseppe Lavagetto) [10:46:59] (03PS2) 10Giuseppe Lavagetto: site.pp: reorganize labs host to use one role() call per node [puppet] - 10https://gerrit.wikimedia.org/r/402786 [10:50:21] !log roll restart swift in codfw for kernel upgrades [10:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:57] 10Operations, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#3882323 (10faidon) No, not resolved yet, but in progress :) You're absolutely right we haven't updated this task though (my fault!) Current progress is: - Netbox has bee... [10:52:12] 10Operations, 10Tracking: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063#3882325 (10faidon) [10:52:15] 10Operations, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#3882324 (10faidon) [10:56:36] (03PS1) 10ArielGlenn: make role::beta::mediawiki into a profile [puppet] - 10https://gerrit.wikimedia.org/r/402803 [10:57:04] 10Operations, 10Internet-Archive, 10Offline-Working-Group: Create backups of Wikimedia content in diverse geographic places - https://phabricator.wikimedia.org/T156544#3882328 (10faidon) Correct, that is the intention. I can confirm that's the case that it's not a rumour, but of course take it with a grain o... [10:57:23] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402805 (https://phabricator.wikimedia.org/T128546) [10:59:21] 10Operations, 10ops-esams, 10netops: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637#3882332 (10faidon) >>! In T174637#3871317, @mark wrote: > Have we acquired a new image for AS14907 yet? We have for some time now. It can be found on install1002 -- I've deleted the AS43821 image to avo... [11:00:04] jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180108T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:37] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402805 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:02:09] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402805 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:02:22] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402805 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:03:14] (03CR) 10Jcrespo: [C: 031] "This looks ok to me, assuming no query or connection errors, but we should keep the revert handy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402801 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [11:03:50] (03CR) 10Marostegui: "> This looks ok to me, assuming no query or connection errors, but we" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402801 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [11:04:43] 10Operations, 10Mail: tls expiry check for mx vs acme-setup renewal period - https://phabricator.wikimedia.org/T181519#3882337 (10fgiunchedi) Noticed this again today ``` mx1001 Exim SMTP WARNING 2018-01-08 11:02:58 11d 18h 35m 27s 3/3 WARNING - Certificate 'mx1001.wikimedia.org' expires in 49 day(s) (Mon 26... [11:05:52] !log jdrewniak@tin Synchronized portals/prod/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:402805|Bumping portals to master (T128546)]] (duration: 00m 51s) [11:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:05] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:06:44] !log jdrewniak@tin Synchronized portals: Wikimedia Portals Update: [[gerrit:402805|Bumping portals to master (T128546)]] (duration: 00m 51s) [11:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:09] 10Operations, 10Patch-For-Review: Update people.wikimedia.org with the 2017 Wikimania hackathon group photo - https://phabricator.wikimedia.org/T184338#3880371 (10faidon) That's not the Wikimania 2017 Hackathon (which was in Montreal), but the 2017 Hackathon in Vienna. Both the task title and the commit title... [11:12:26] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3882348 (10hashar) [11:12:29] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3882346 (10hashar) 05Open>03... [11:12:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsoid, 10Patch-For-Review: decom wtp1001-wtp1024 - https://phabricator.wikimedia.org/T177374#3882349 (10mobrovac) [11:12:44] PROBLEM - Check Varnish expiry mailbox lag on cp4021 is CRITICAL: CRITICAL: expiry mailbox lag is 2114381 [11:14:07] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3882351 (10MoritzMuehlenhoff) C... [11:14:33] (03PS6) 10Hashar: prometheus: make ferm DNS record type configurable [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T153468) [11:14:35] (03CR) 10Hashar: "Attached to T153468" [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T153468) (owner: 10Hashar) [11:15:12] 10Operations, 10DNS, 10Traffic, 10Beta-Cluster-reproducible, and 2 others: Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances - https://phabricator.wikimedia.org/T153468#2881386 (10hashar) I had the issue a while back T176314#3640963 and went with a workaround of `s/AAAA/A/... [11:15:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsoid, 10Patch-For-Review: decom wtp1001-wtp1024 - https://phabricator.wikimedia.org/T177374#3656431 (10fgiunchedi) FWIW two days ago three hosts that were decom as part of this task showed up in icinga (ack'd the alerts now): ``` wtp1018.mgmt  DOWN 2018-01-08... [11:18:40] (03PS7) 10Paladox: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665 [11:19:06] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3877128 (10ArielGlenn) Welp, it... [11:19:16] (03CR) 10Paladox: "@Markusguenther thanks :)." [puppet] - 10https://gerrit.wikimedia.org/r/402665 (owner: 10Paladox) [11:20:41] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-kafka-jumbo-[12] due to version of a package being missing - https://phabricator.wikimedia.org/T184240#3882395 (10Paladox) @Krenair the change was merged now, should we close as resolved? :) [11:23:05] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9619/" [puppet] - 10https://gerrit.wikimedia.org/r/402786 (owner: 10Giuseppe Lavagetto) [11:24:59] (03PS2) 10Giuseppe Lavagetto: logstash: create compound role [puppet] - 10https://gerrit.wikimedia.org/r/402787 [11:28:33] !log puppet node deactivate wtp10[568] - T177374 [11:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:46] T177374: decom wtp1001-wtp1024 - https://phabricator.wikimedia.org/T177374 [11:33:07] (03PS3) 10Giuseppe Lavagetto: logstash: create compound role [puppet] - 10https://gerrit.wikimedia.org/r/402787 [11:38:14] !log rebooting mwdebug* for kernel security update [11:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:43] (03PS1) 10Alexandros Kosiaris: servermon: Amend the /static alias [puppet] - 10https://gerrit.wikimedia.org/r/402809 [11:39:55] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] servermon: Amend the /static alias [puppet] - 10https://gerrit.wikimedia.org/r/402809 (owner: 10Alexandros Kosiaris) [11:40:31] (03CR) 10Aklapper: [C: 04-1] "The commit (photo) has nothing to do with what the commit message states" [puppet] - 10https://gerrit.wikimedia.org/r/402583 (https://phabricator.wikimedia.org/T184338) (owner: 10Framawiki) [11:43:26] (03PS3) 10Filippo Giunchedi: graphite: cleanup stale ORES metrics [puppet] - 10https://gerrit.wikimedia.org/r/401917 (https://phabricator.wikimedia.org/T169969) [11:44:38] 10Operations, 10ORES, 10Graphite, 10Patch-For-Review, and 2 others: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3882455 (10fgiunchedi) @Halfak see https://gerrit.wikimedia.org/r/401917 [11:47:42] (03PS39) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [11:48:48] (03CR) 10Paladox: "@Markusguenther some users that i have shown this too say it looks beautiful or looks good." [puppet] - 10https://gerrit.wikimedia.org/r/402665 (owner: 10Paladox) [11:56:46] (03CR) 10Markusguenther: "It is not responsive. Or better it was not intended because the TYPO3 gerrit UI is not responsive at all." [puppet] - 10https://gerrit.wikimedia.org/r/402665 (owner: 10Paladox) [11:57:04] (03PS8) 10Paladox: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665 [11:57:23] (03CR) 10Paladox: "> It is not responsive. Or better it was not intended because the" [puppet] - 10https://gerrit.wikimedia.org/r/402665 (owner: 10Paladox) [12:01:09] !log rebooting mw1221-mw1235 for kernel security update (along with update to HHVM 3.18.6 where applicable) [12:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:48] RECOVERY - Check Varnish expiry mailbox lag on cp4021 is OK: OK: expiry mailbox lag is 5 [12:06:01] (03PS4) 10Addshore: Log wikibase dispatchChanges script for testwikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/395968 [12:06:21] (03PS3) 10Marostegui: db-eqiad.php: Warm up future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402801 (https://phabricator.wikimedia.org/T177208) [12:08:12] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3882490 (10dr0ptp4kt) @Shilad I just wanted to note that I'm back from the long period of family leave (everything's good, BTW) and saw your comment. We're not 10... [12:18:19] !log akosiaris@tin Started deploy [servermon/servermon@b9832c5]: Update servermon [12:18:21] !log akosiaris@tin Finished deploy [servermon/servermon@b9832c5]: Update servermon (duration: 00m 02s) [12:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:24] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9621/" [puppet] - 10https://gerrit.wikimedia.org/r/402787 (owner: 10Giuseppe Lavagetto) [12:23:37] (03PS4) 10Giuseppe Lavagetto: logstash: create compound role [puppet] - 10https://gerrit.wikimedia.org/r/402787 [12:24:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Warm up future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402801 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [12:26:28] (03Merged) 10jenkins-bot: db-eqiad.php: Warm up future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402801 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [12:26:32] (03CR) 10Alexandros Kosiaris: [C: 032] pcc: Python3 compatibility (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402119 (owner: 10BryanDavis) [12:26:34] (03PS2) 10Alexandros Kosiaris: pcc: Python3 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/402119 (owner: 10BryanDavis) [12:26:36] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] pcc: Python3 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/402119 (owner: 10BryanDavis) [12:26:39] (03CR) 10jenkins-bot: db-eqiad.php: Warm up future s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402801 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [12:26:50] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] logstash: create compound role [puppet] - 10https://gerrit.wikimedia.org/r/402787 (owner: 10Giuseppe Lavagetto) [12:26:56] (03PS5) 10Giuseppe Lavagetto: logstash: create compound role [puppet] - 10https://gerrit.wikimedia.org/r/402787 [12:26:59] <_joe_> grr [12:27:13] (03PS1) 10Marostegui: Revert "db-eqiad.php: Warm up future s8 hosts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402813 [12:27:32] (03PS2) 10Jcrespo: mariadb: Remove comments about partitioning on db2039 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402599 (https://phabricator.wikimedia.org/T184090) [12:27:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Warm up s8 future hosts - T177208 (duration: 00m 52s) [12:27:56] jynus: ^ [12:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:06] T177208: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208 [12:28:41] looking [12:29:00] db1098:3318 is complaining on logs [12:30:07] cannot see it, where? [12:30:27] 2018-01-08T12:29:49 [12:30:27] ja.wikipedia.org [12:30:28] jawiki [12:30:32] Server db1098:3318 is not replicating? [12:30:39] jawiki is s6 [12:30:44] not s5/s8 [12:30:46] it is failing on more wikis [12:30:48] reverting [12:31:01] (03CR) 10Marostegui: [V: 032 C: 032] Revert "db-eqiad.php: Warm up future s8 hosts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402813 (owner: 10Marostegui) [12:31:01] ok with that [12:31:43] "Error connecting to db1098:3318: Can't connect to MySQL server on 'db1098' " [12:31:51] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Warm up future s8 hosts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402813 (owner: 10Marostegui) [12:32:37] I see the error, I pooled db1098 instead of db1099 [12:32:41] ok [12:32:49] I will send a new patch [12:34:24] (03PS2) 10Giuseppe Lavagetto: site.pp: fix more cases of multiple roles being declared [puppet] - 10https://gerrit.wikimedia.org/r/402788 [12:34:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: revert warm up s8 future hosts - T177208 (duration: 02m 58s) [12:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:03] T177208: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208 [12:35:23] !log rebooting mw1209-mw1220 for kernel security update (along with update to HHVM 3.18.6 where applicable) [12:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:30] !log fdans@tin Started deploy [analytics/aqs/deploy@ab85797]: (no justification provided) [12:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:46] !log fdans@tin Finished deploy [analytics/aqs/deploy@ab85797]: (no justification provided) (duration: 00m 16s) [12:37:48] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Don’t check constraints on example properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399825 (https://phabricator.wikimedia.org/T183267) (owner: 10Lucas Werkmeister (WMDE)) [12:37:50] (03PS1) 10Marostegui: db-eqiad.php: Warm up s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402817 (https://phabricator.wikimedia.org/T177208) [12:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:12] jynus: that the new patch [12:38:29] it is the same but with db1099 instead of db1098 [12:40:27] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Warm up s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402817 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [12:41:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Warm up s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402817 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [12:42:57] Hi ops-team - I'm about to deploy AQS for a new endpoint (not visible externally yet) Please let me know of any concern [12:43:23] (03Merged) 10jenkins-bot: db-eqiad.php: Warm up s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402817 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [12:43:29] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9623/" [puppet] - 10https://gerrit.wikimedia.org/r/402788 (owner: 10Giuseppe Lavagetto) [12:43:36] (03CR) 10jenkins-bot: db-eqiad.php: Warm up s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402817 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [12:44:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Warm up s8 future hosts - T177208 (duration: 00m 59s) [12:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:50] T177208: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208 [12:46:59] jynus: I think the 3 layer replication is too much for mediawiki [12:47:05] and it is timing out on the hosts [12:47:26] https://logstash.wikimedia.org/goto/464401a786bc8eba5e5fec60af8a82e8 [12:47:48] (03PS2) 10Giuseppe Lavagetto: site.pp: rationalize prometheus, puppetmaster frontends [puppet] - 10https://gerrit.wikimedia.org/r/402789 [12:47:50] I am going to revert, we could warm up the hosts by doing some full table scans [12:47:57] wait [12:48:09] it is back to normal now? [12:48:12] one sec [12:48:17] don't think so [12:48:29] I have the auto refresh, and it is generating new logs [12:48:34] revert [12:48:44] (03PS1) 10Marostegui: Revert "db-eqiad.php: Warm up s8 hosts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402820 [12:48:56] (03CR) 10Marostegui: [V: 032 C: 032] Revert "db-eqiad.php: Warm up s8 hosts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402820 (owner: 10Marostegui) [12:49:01] but the problem is not the topology [12:49:09] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Warm up s8 hosts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402820 (owner: 10Marostegui) [12:49:12] !log joal@tin Started deploy [analytics/aqs/deploy@ab85797]: Add pageview top-by-country endpoint [12:49:13] it is not using gtid [12:49:18] that is a problem [12:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:29] Using_Gtid: Slave_Pos [12:49:33] (that is db1082) [12:49:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Warm up s8 future hosts - T177208 (duration: 00m 27s) [12:49:48] no, I mean mediawiki [12:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:51] ah [12:49:51] T177208: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208 [12:49:57] "Wikimedia\Rdbms\LoadBalancer::doWait: Timed out waiting on db1100 pos db1071-bin.006093/1039147444" [12:50:08] it doesn't say timed out waiting for gtidXXXXXXXXX [12:50:43] true [12:50:54] and those are not coming from dewiki/wikidata [12:51:06] they are coming from other databases: commons, idwiki [12:52:24] And they are actually still happening [12:52:25] also, why are things waiting on db1071 [12:52:34] that is s8 master [12:52:39] which is not pooled [12:53:25] And they are still waiting [12:54:47] 10Operations, 10ops-eqiad, 10Analytics-Kanban: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3882607 (10elukey) Now the BMC/IPMI doesn't seem to be happy: ``` elukey@dbstore1002:~$ sudo ipmi-chassis --get-chassis-status ipmi_cmd_get_chassis_status: BMC busy elukey@dbstore10... [12:59:18] Very strange, I don't get it. Maybe having the same host on two different shards is something mediawiki dislikes? [12:59:37] are there still errors? [12:59:43] no no [12:59:52] I was trying to understand what happpened [12:59:55] !log reboot kafka1012 for kernel updates [12:59:57] Reviewing things [13:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:15] maybe the job runners got confused for longer [13:00:43] as they may not be refreshing thing for longer [13:00:53] we should deploy again with lower load [13:01:19] or on mwdebug [13:01:47] we can try just one single host [13:01:54] we fixed the topology problems long time ago [13:02:11] what I cannot say is why it even mentions the binary log [13:02:28] as that should never happen- it should use gtid, or fail [13:02:56] and why checking db1071 one? [13:03:06] we should check the loadbalancer code [13:03:12] I guess it checks its inmediate master, which is db1071 [13:03:15] (03PS1) 10Marostegui: db-eqiad.php: Try to warm up db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402822 [13:03:35] and if it fails to get a gtid, it fails back to its immediate master? [13:03:44] but, assiming that is true [13:03:54] why it didn't work, as the master is ok [13:04:14] Don't know, maybe because the master has another master? [13:04:23] check that changeset [13:04:28] we can try to depool that one [13:04:31] *pool [13:04:42] I would do that [13:04:47] depool it from s8 [13:05:06] but I smell of bug [13:05:14] let's try that, depooling it from s8 indeed [13:05:16] of course, it its a strange situation [13:05:30] (03PS1) 10Ladsgroup: Enable fine grained usage tracking in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402823 (https://phabricator.wikimedia.org/T172914) [13:05:36] (03PS2) 10Marostegui: db-eqiad.php: Try to warm up db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402822 [13:05:50] that must work [13:05:52] !log rebooting mw1259/mw1260 (video scalers) for kernel security update (along with update to HHVM 3.18.6 where applicable) [13:05:58] it could be that [13:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:20] as dbs are not really on s8, because it technically does not exist yet [13:06:27] it fails back to binary log [13:06:46] but I wonder why it doesn't detect it is part of s5 [13:07:09] !log joal@tin Finished deploy [analytics/aqs/deploy@ab85797]: Add pageview top-by-country endpoint (duration: 17m 57s) [13:07:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Try to warm up db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402822 (owner: 10Marostegui) [13:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:21] is pt-heartbeat rights ok? [13:07:36] rights == grants [13:10:03] Those hosts were in s5 first, serving traffic [13:10:05] (03Merged) 10jenkins-bot: db-eqiad.php: Try to warm up db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402822 (owner: 10Marostegui) [13:10:07] (03CR) 10jenkins-bot: db-eqiad.php: Try to warm up db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402822 (owner: 10Marostegui) [13:10:13] So we should've seen errors before when they were serving if heartbeat was wrong [13:11:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Warm up db1109 (duration: 00m 52s) [13:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:11] The next test should be pooling db1109 also on s8 [13:15:34] actually, we do not need that [13:15:36] db1109 is receiving queries fine and not erroring [13:15:55] remember the final aim is to warmup [13:16:06] we can pool all and depool all [13:16:18] and then revert before switchover [13:16:23] I still see a few errors regarding db1082 [13:16:29] of course, there is a potential bug [13:16:31] (03CR) 10Mobrovac: [C: 031] "GTG now." [puppet] - 10https://gerrit.wikimedia.org/r/401784 (https://phabricator.wikimedia.org/T184110) (owner: 10Mobrovac) [13:16:41] but I would not enter into debugging today [13:17:08] db1082 is s5 and was not touched [13:17:25] yeah, but why is it timing out on db1071? [13:17:52] timeouts are normal [13:18:00] as long as they are very few [13:18:02] on db1071, s8 master? [13:18:10] on 71, not [13:18:11] Wikimedia\Rdbms\LoadBalancer::doWait: Timed out waiting on db1082 pos db1071-bin.006094/65955016 [13:18:19] ok [13:18:23] there are very very very few ones [13:18:29] so db1082 is the one is timing out [13:18:41] maybe mediawiki got confused, some slaves replicating from one master and others replicating from another one within the same shard? [13:18:55] no, we fixed that problem long time ago [13:19:11] as in, we knew that was an error and was explicitly fixed by using gtid [13:19:25] and remember in the past [13:19:36] we change topology in advance to the failover [13:19:45] always working [13:19:49] yeah, that is true [13:21:20] most errors seems to be coming from rpc.php [13:21:39] which could be some strange cache/mediawiki stalled code [13:21:55] threads not being restarted and reloading configuration [13:22:23] or maybe there is a bug when things are pooled on 2 shards [13:22:38] I remember discussing that multi-source didn't work well on mediawiki [13:22:47] could be yes [13:23:19] leave the new one pooled for a while [13:23:26] yeah, I am going to do that [13:23:27] if the error rate is going down [13:23:30] no change [13:23:35] and we can monitor the status [13:25:01] connection swapps could also be a factor of higher connection error rates [13:25:30] many connections changing at the same time, making timeouts fail [13:25:32] or [13:25:54] reused connections not working as good as it could when changing replica sets [13:29:47] don't know, it is weird, I know see more than usual SELECT master_gtid_wait for s5 hosts on tendril [13:30:12] maybe we would need to restart the job queue after switchover, which means we many need tim or giuseppe around [13:30:58] are any of those not connected to scap? [13:31:08] that also could be it [13:32:03] conencted to scap? [13:32:12] outadated code [13:32:19] let me see [13:32:25] in this case, outadated configuration code [13:32:50] could be also that gtid fails in this particular instance [13:32:58] code looks good [13:33:16] what about runtime of processes? [13:33:18] I think it is not coincidence, it was not happening before all the changes [13:33:34] has it been running longer than the change? [13:33:58] no, they are new, seconds old [13:34:08] I am going to remove db1109 from s5 and leave things as they were [13:34:09] s8 replicas may be missing some gtid events [13:34:10] to see what happens [13:34:33] wait "180359179-180359179-96523837,171974884-171974884-1473084269,0-180359179-5734605861,171978778-171978778-3980319,171970704-171970704-351094624,171978768-171978768-202416,171978777-171978777-326342503" [13:34:36] is not normal [13:34:49] that is what I am saying :) [13:34:55] it should wait on just 1 gtid [13:35:22] (03PS1) 10Marostegui: db-eqiad.php: Restore db1109 normal position [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402828 [13:35:28] what is the gtid config of db1071? [13:35:37] Slave_pos I think [13:35:47] I would switch it to no [13:35:54] Slave_pos indeed [13:37:06] but for instance, db1106 is now showing some select master gtid wait, and db1106 has nothing to do with db1071, as its master is db1070 [13:37:14] let me remove db1109 so we can start from 0 [13:37:21] and see if those connections dissapear [13:37:43] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1109 normal position [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402828 (owner: 10Marostegui) [13:38:54] (03CR) 10Rush: "tag along to faidon's -1" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [13:39:07] ACKNOWLEDGEMENT - MegaRAID on db1059 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough Jcrespo https://phabricator.wikimedia.org/T184160 [13:39:14] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1109 normal position [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402828 (owner: 10Marostegui) [13:39:24] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1109 normal position [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402828 (owner: 10Marostegui) [13:40:03] chasemp: didn't know about CI/py3, do you know if there's a task for that? [13:40:06] we should definitely fix that [13:40:19] volans: ^^ [13:40:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1109 original status (duration: 00m 50s) [13:40:37] paravoid: arturo talked briefly with hashar about it my impression was yes but not sure where [13:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:06] !log reboot analytics10[36-39] for kernel updates [13:41:14] hashar/volans: context is "CI linting scripts with #!/usr/bin/env python3 in operations/puppet" [13:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:32] I don't remember [13:41:34] paravoid: uh? [13:41:51] paravoid: thanks for the review there man, I didn't know python-apt existed :) [13:41:55] volans: https://gerrit.wikimedia.org/r/#/c/398079/9/modules/apt/files/apt-upgrade.py@1 [13:42:17] (jfyi, because I thought you'd be interested or may even know something about it :) [13:42:37] BTW thanks for the reviews paravoid :-) [13:42:42] !log rolling restart of wdqs servers for kernel upgrades [13:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:23] paravoid: ahhh now I got what you mean, yes I remember some issue with it... let me try to find some paper trail [13:44:57] IIRC is that we run tox in our puppet repo and that catches all of them with the same env, but I need to check [13:45:04] After removing db1109 errors on db1082 waiting for db1071 have stopped - very weird [13:45:11] 10Operations, 10ops-eqiad, 10Analytics-Kanban: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3882714 (10Cmjohnson) @elukey yes, the server will need to be powered down for a minute to unlock the Idrac. Can we do this right after meeting today or do you want to schedule for to... [13:46:24] 10Operations, 10ops-eqiad, 10Analytics-Kanban: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3882716 (10elukey) @Cmjohnson Would it be fine tomorrow around this time? Or whenever you prefer, I'd need to send an email and announce the downtime, better to alert people :) [13:46:49] volans: short version I think is it's based on .py on the file and CI had no ability to for py3 context in that case [13:46:59] paravoid, chasemp see T152950#2881201 as an example [13:47:00] T152950: E901 SyntaxError: invalid syntax is wrongly raised on using python's abc by jenkins python CI linter - https://phabricator.wikimedia.org/T152950 [13:47:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsoid, 10Patch-For-Review: decom wtp1001-wtp1024 - https://phabricator.wikimedia.org/T177374#3882717 (10Cmjohnson) This was me last week, these servers have not gone through the decom steps yet and still have puppet running. [13:47:07] of previous discussions about it [13:47:42] regarding flake8 failing to properly lint a valid py3 script [13:47:45] :( [13:48:30] volans chasemp paravoid: the tox environment that runs flake8 is using python2. So if it tries to parse a python3 file it bails out [13:48:32] 10Operations, 10ops-eqiad, 10Analytics-Kanban: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3882721 (10Cmjohnson) @elukey. Let’s schedule for 1500UTC tomorrow. [13:49:08] I am not sure whether a task got filled. On top of my head, the idea was define two different environement, one for py2 and another with py3 [13:49:28] we should probably revive T144169 and make it general for all python-related changes, to have CI send to a script the list of modified files and have the script detect the python version and run the linting appropriately [13:49:29] T144169: Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169 [13:49:51] hashar: couldnt you run both at the same time? [13:49:53] and pass file filters to have flake8 running under python3 to only process .py3 files (and thus files using python3 would need to use the .py3 extension [13:50:01] no [13:50:12] no .py3 please :) [13:50:22] that's entirely non-standard and shouldn't be required [13:50:37] !log rebooting mw image scalers in eqiad for kernel security update (along with update to HHVM 3.18.6 where applicable) [13:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:02] else we gotta grab the list of python files changed in the patchset, check their shebangs to estalish a list of python2 vs python3 files [13:51:28] what about files without a shebang? [13:51:28] and then inject those files to a flake8 running py2 and another flake8 running py3 [13:52:46] unfortunately I gotta run and I'll be back many hours later (lotsameetings), can one of you file a task in the meantime so that we can discuss there? [13:52:51] or revive one, or whatever :) [13:52:58] sure [13:53:49] arturo/chasemp: (that obviously shouldn't block your work, we can keep that as py2.7 in the meantime :) [13:54:00] 10Operations, 10ops-eqiad, 10Analytics-Kanban: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3882726 (10elukey) downtime announced to engineering@ and analytics@ [13:55:19] great paravoid thanks [13:57:18] arturo: also, if not using python3-only syntaxes I think flake8 should not complain also if it's py3, but worth checking it before [14:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180108T1400). [14:00:05] Lucas_WMDE, Jayprakash12345, Urbanecm, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:17] I'm here :) [14:00:24] I’m here! :) [14:00:39] o/ [14:01:24] Who will be the swatter? [14:02:19] o/ [14:03:19] o/ [14:03:32] (03CR) 10Hashar: [C: 032] "Ica42f7d125d803b6d4a49711794d5626e48e5aef is in 1.31.0-wmf.15 which is the deployed version on all groups :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399825 (https://phabricator.wikimedia.org/T183267) (owner: 10Lucas Werkmeister (WMDE)) [14:03:43] hashar: are you doing the swat today? [14:03:50] (03CR) 10jerkins-bot: [V: 04-1] Don’t check constraints on example properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399825 (https://phabricator.wikimedia.org/T183267) (owner: 10Lucas Werkmeister (WMDE)) [14:05:03] bah [14:05:15] Lucas_WMDE: your patch seems to need a rebase https://gerrit.wikimedia.org/r/#/c/399825/ [14:05:27] oh ok [14:06:31] doing Urbanecm patches [14:06:43] hashar, ack [14:07:15] (03PS2) 10Lucas Werkmeister (WMDE): Don’t check constraints on example properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399825 (https://phabricator.wikimedia.org/T183267) [14:07:31] (03CR) 10Hashar: [C: 032] Enable wgKartographerStaticMapframe for lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401590 (https://phabricator.wikimedia.org/T183981) (owner: 10Urbanecm) [14:07:47] (03CR) 10Lucas Werkmeister (WMDE): "Rebased." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399825 (https://phabricator.wikimedia.org/T183267) (owner: 10Lucas Werkmeister (WMDE)) [14:08:33] bah [14:08:45] hashar, what's happening? [14:08:50] I am rusty :] [14:09:34] (03CR) 10Hashar: [C: 032] "> I’m confused – did you mean If068c786779122d4f5ff158c9ac8a9a6e6610535? :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399825 (https://phabricator.wikimedia.org/T183267) (owner: 10Lucas Werkmeister (WMDE)) [14:09:44] gotta rebase stuff a bit [14:09:57] hashar, may I help somehow? [14:10:38] (03PS3) 10Hashar: Enable wgKartographerStaticMapframe for lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401590 (https://phabricator.wikimedia.org/T183981) (owner: 10Urbanecm) [14:10:40] (03PS3) 10Hashar: Move wiktionary HD logo to wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401519 (https://phabricator.wikimedia.org/T183922) (owner: 10Urbanecm) [14:10:42] (03PS4) 10Hashar: Update logo for chrwiki, add the HD version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401593 (https://phabricator.wikimedia.org/T180553) (owner: 10Urbanecm) [14:10:47] Urbanecm: I just cherry pick them locally [14:10:54] Ok [14:11:24] (03Merged) 10jenkins-bot: Don’t check constraints on example properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399825 (https://phabricator.wikimedia.org/T183267) (owner: 10Lucas Werkmeister (WMDE)) [14:11:37] (03CR) 10jenkins-bot: Don’t check constraints on example properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399825 (https://phabricator.wikimedia.org/T183267) (owner: 10Lucas Werkmeister (WMDE)) [14:12:11] Lucas_WMDE: ok sorry for the delay. Your change is on mwdebug1001 [14:12:27] hashar: thanks! I’ll test it [14:12:39] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401590 (https://phabricator.wikimedia.org/T183981) (owner: 10Urbanecm) [14:12:49] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401519 (https://phabricator.wikimedia.org/T183922) (owner: 10Urbanecm) [14:12:59] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401593 (https://phabricator.wikimedia.org/T180553) (owner: 10Urbanecm) [14:13:15] Urbanecm: and I am going to push all your three changes to mwdebug1001 as soon as they get merged [14:13:26] Ok [14:13:58] (03Merged) 10jenkins-bot: Enable wgKartographerStaticMapframe for lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401590 (https://phabricator.wikimedia.org/T183981) (owner: 10Urbanecm) [14:14:20] (03CR) 10Awight: "Looks real nice--is the EQCSS library necessary, though?" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/402665 (owner: 10Paladox) [14:14:26] Lucas_WMDE: scap is still syncing on mwdebug1001 :( [14:14:31] ah no [14:14:33] (03CR) 10jenkins-bot: Enable wgKartographerStaticMapframe for lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401590 (https://phabricator.wikimedia.org/T183981) (owner: 10Urbanecm) [14:14:34] it is complete [14:14:39] hashar: I’m not sure what scap is, sorry :) [14:14:42] but it seems to work [14:15:02] scap is the deployment tool we use to push the code to mwdebug1001 and then to the rest of the infra [14:15:17] ah ok [14:15:38] hashar, so what changes are at mwdebug? [14:16:00] only the wikidata one so far [14:16:01] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: prometheus-blazegraph-exporter failing to start after reboot - https://phabricator.wikimedia.org/T184434#3882767 (10Gehel) [14:16:02] hashar: Hello [14:16:05] hashar, ok [14:16:17] Jayprakash12345: hello :) I am proceeding some other changes before yours :) [14:16:23] !log hashar@tin Synchronized wmf-config/Wikibase-production.php: Don’t check constraints on example properties - T183267 (duration: 00m 51s) [14:16:23] hashar: I did a bit more testing, looks like the constraints change works just like intended \o/ [14:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:33] T183267: Don’t check constraints on “Wikidata property example” statements - https://phabricator.wikimedia.org/T183267 [14:16:43] Lucas_WMDE: it is in production now \oo/ [14:16:49] thank you! [14:16:58] hashar: Ok :) [14:17:40] (03Merged) 10jenkins-bot: Move wiktionary HD logo to wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401519 (https://phabricator.wikimedia.org/T183922) (owner: 10Urbanecm) [14:17:50] (03Merged) 10jenkins-bot: Update logo for chrwiki, add the HD version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401593 (https://phabricator.wikimedia.org/T180553) (owner: 10Urbanecm) [14:18:17] (03CR) 10jenkins-bot: Move wiktionary HD logo to wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401519 (https://phabricator.wikimedia.org/T183922) (owner: 10Urbanecm) [14:18:36] gehel: Are you here? [14:18:45] (03PS5) 10Hashar: Enable commons import in tawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399556 (https://phabricator.wikimedia.org/T181774) (owner: 10Jayprakash12345) [14:18:47] (03PS4) 10Hashar: Add new namespace aliases on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400267 (https://phabricator.wikimedia.org/T183711) (owner: 10Jayprakash12345) [14:18:49] Jayprakash12345: yep, I'm here! [14:18:49] (03PS5) 10Hashar: Turn on mapframe for Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400682 (https://phabricator.wikimedia.org/T183764) (owner: 10Jayprakash12345) [14:18:51] (03PS6) 10Hashar: Add Translation: namespace on Punjabi Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389433 (https://phabricator.wikimedia.org/T179807) (owner: 10Jayprakash12345) [14:18:52] Urbanecm: all your changes are now on mwdebug1001 [14:18:59] hashar, will test them [14:19:19] (03CR) 10Hashar: [C: 032] "SWAT!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399556 (https://phabricator.wikimedia.org/T181774) (owner: 10Jayprakash12345) [14:19:24] 10Operations, 10Continuous-Integration-Config: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#3882789 (10Volans) [14:19:33] 10Operations, 10Continuous-Integration-Config: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#3882801 (10Volans) p:05Triage>03Normal [14:20:07] hashar, can you deploy them? [14:20:13] Urbanecm: sure :) [14:20:18] thx [14:21:13] Jayprakash12345: I'm around if you need me, but I'll expect everything should just go smoothly. I'll keep an eye open on my usual graphs... [14:21:41] hashar: In https://gerrit.wikimedia.org/r/#/c/399556/ , We cant test it on mwdebug. [14:21:49] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable wgKartographerStaticMapframe for lvwiki - T183981 (duration: 00m 51s) [14:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:01] T183981: Enable wgKartographerStaticMapframe for lvwiki - https://phabricator.wikimedia.org/T183981 [14:22:11] hashar: So Synchronized. [14:22:31] (03Merged) 10jenkins-bot: Enable commons import in tawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399556 (https://phabricator.wikimedia.org/T181774) (owner: 10Jayprakash12345) [14:22:44] Jayprakash12345: https://gerrit.wikimedia.org/r/#/c/399556/ yes I will just deploy it [14:23:14] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Move wiktionary HD logo to wiktionaries - T183922 (duration: 00m 50s) [14:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:27] T183922: There's one wiktionary entry in wgLogoHD in IS.php - https://phabricator.wikimedia.org/T183922 [14:25:47] (03CR) 10Hashar: [C: 032] Add new namespace aliases on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400267 (https://phabricator.wikimedia.org/T183711) (owner: 10Jayprakash12345) [14:25:53] !log hashar@tin Synchronized static/images/project-logos: Update logo for chrwiki, add the HD version T180553 (duration: 00m 51s) [14:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:04] T180553: Cherokee Wikipedias uses an outdated logo - https://phabricator.wikimedia.org/T180553 [14:26:32] !log cache_text: upgrade to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267 [14:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:43] T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656 [14:27:03] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Update logo for chrwiki, add the HD version T180553 (duration: 00m 50s) [14:27:05] Urbanecm: so I think I have deployed all your changes now [14:27:12] hashar, that's good! [14:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:16] Should I check them? [14:27:46] Urbanecm: probably :) [14:27:58] (03Merged) 10jenkins-bot: Add new namespace aliases on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400267 (https://phabricator.wikimedia.org/T183711) (owner: 10Jayprakash12345) [14:28:02] Ok, let's do it :) [14:28:10] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: prometheus-blazegraph-exporter failing to start after reboot - https://phabricator.wikimedia.org/T184434#3882829 (10Gehel) Extract from the logs: ``` Jan 08 14:20:08 wdqs2001 systemd[1]: Started P... [14:28:25] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable commons import in tawikisource - T181774 (duration: 00m 48s) [14:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:38] T181774: Enabling options to import from commons to Tamil wikisource - https://phabricator.wikimedia.org/T181774 [14:29:44] Jayprakash12345: I am not processing the rest of your changes [14:30:01] hashar: Why? [14:30:11] ??????????????????????????????????????????????, [14:30:25] I mean I am going to deploy the changes you listed on https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180108T1400 :D [14:32:47] Jayprakash12345: there are some bad links in the database, but that is just for a few articles so I guess we can fix them later ( https://phabricator.wikimedia.org/T183711#3882838 ) [14:33:06] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Add new namespace aliases on zhwiki - T183711 (duration: 00m 50s) [14:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:17] T183711: Request new namespace aliases for "User:", "File:" and their talk page spaces on zh wikipedia - https://phabricator.wikimedia.org/T183711 [14:33:42] (03CR) 10Hashar: [C: 032] Turn on mapframe for Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400682 (https://phabricator.wikimedia.org/T183764) (owner: 10Jayprakash12345) [14:34:24] gehel: Please Around. [14:34:40] Jayprakash12345: o/ [14:35:03] (03Merged) 10jenkins-bot: Turn on mapframe for Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400682 (https://phabricator.wikimedia.org/T183764) (owner: 10Jayprakash12345) [14:35:43] (03PS9) 10Paladox: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665 [14:36:01] Jayprakash12345: gehel is that for mapframe being enabled to the arabic wiki? [14:36:07] do you want to test it out on mwdebug1001 firsT? [14:36:23] I have pulled it there [14:36:32] gehel: You can test? [14:36:36] hashar: I'll let Jayprakash12345 do the testing. I'm just standing by in case something goes wrong (unlikely) [14:36:50] gehel: ok [14:37:00] (03PS40) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [14:37:05] 10Operations, 10MediaWiki-Configuration, 10discovery-system: Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597#3882858 (10Joe) [14:37:19] * gehel notes that he is happy to see maps (frame or not frame) on more wikis! [14:37:53] gehel: we should make it the default :] [14:38:10] Jayprakash12345: gehel and it is now enabled on mwdebug1001 [14:39:02] (03PS10) 10Paladox: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665 [14:39:27] PROBLEM - puppet last run on lvs1007 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 28 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[ethtool eth2 -K lro off],Exec[txqueuelen-eth2],Exec[ethtool eth3 -K lro off],Exec[txqueuelen-eth3] [14:40:21] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, strategy for deployment:" [puppet] - 10https://gerrit.wikimedia.org/r/401784 (https://phabricator.wikimedia.org/T184110) (owner: 10Mobrovac) [14:40:55] (03CR) 10Hashar: [C: 032] Add Translation: namespace on Punjabi Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389433 (https://phabricator.wikimedia.org/T179807) (owner: 10Jayprakash12345) [14:42:21] (03Merged) 10jenkins-bot: Add Translation: namespace on Punjabi Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389433 (https://phabricator.wikimedia.org/T179807) (owner: 10Jayprakash12345) [14:43:22] gehel: Like it is not working https://ar.wikipedia.org/wiki/%D9%85%D8%B3%D8%AA%D8%AE%D8%AF%D9%85:Jayprakash12345/a [14:43:41] Jayprakash12345: null-edit just to be sure? [14:44:00] debt: ^ since you did most of our testing of mapframe so far... [14:44:08] (03PS11) 10Paladox: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665 [14:44:17] gehel: sooory [14:45:12] (03CR) 10Paladox: Update gerrit login display (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/402665 (owner: 10Paladox) [14:46:13] Jayprakash12345: i fixed it [14:46:31] well or you did :] [14:46:39] yep, looks good now... [14:46:47] Jayprakash12345: https://ar.wikipedia.org/w/index.php?title=%D9%85%D8%B3%D8%AA%D8%AE%D8%AF%D9%85:Jayprakash12345/a&diff=26615454&oldid=26615431 [14:46:57] surely HTML with left to right language can lead to some oddities :( [14:47:18] (03PS3) 10Giuseppe Lavagetto: site.pp: rationalize prometheus, puppetmaster frontends [puppet] - 10https://gerrit.wikimedia.org/r/402789 [14:47:36] hashar: Thanks [14:47:40] syncing it works for me [14:47:40] that's a pretty good test! I'm not sure we already have any other RTL language [14:47:51] Jayprakash12345: thanks a lot for pushing this forward! [14:47:59] make sure to announce it everywhere to have more maps added on all of arwiki ! [14:48:19] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Turn on mapframe for Arabic Wikipedia - T183764 (duration: 00m 51s) [14:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:31] T183764: Turn on mapframe for Arabic Wikipedia - https://phabricator.wikimedia.org/T183764 [14:50:39] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: prometheus-blazegraph-exporter failing to start after reboot - https://phabricator.wikimedia.org/T184434#3882767 (10MoritzMuehlenhoff) That's a bug in the systemd unit of prometheus-blazegraph-expo... [14:51:19] Amir1: still around? [14:51:30] hashar: Is this Add Translation namespace on Punjabi Wikisource on mwdebug? [14:51:32] hashar: o/ [14:51:34] yup [14:51:36] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Add Translation: namespace on Punjabi Wikisource - T179807 (duration: 00m 50s) [14:51:36] (03PS8) 10Albert221: Remove language button from Wikidata and MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402643 (https://phabricator.wikimedia.org/T183665) [14:51:43] Jayprakash12345: I have deployed it to the whole cluster [14:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:49] T179807: Add Translation namespace on Punjabi Wikisource - https://phabricator.wikimedia.org/T179807 [14:51:55] hashar: Ok. [14:51:57] (03CR) 10Giuseppe Lavagetto: [C: 032] site.pp: rationalize prometheus, puppetmaster frontends [puppet] - 10https://gerrit.wikimedia.org/r/402789 (owner: 10Giuseppe Lavagetto) [14:51:59] Amir1: going to do https://gerrit.wikimedia.org/r/402823 [14:52:03] Jayprakash12345: Thank you :] [14:52:14] hashar: nothing testable :D [14:52:24] (03PS2) 10Hashar: Enable fine grained usage tracking in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402823 (https://phabricator.wikimedia.org/T172914) (owner: 10Ladsgroup) [14:53:20] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402823 (https://phabricator.wikimedia.org/T172914) (owner: 10Ladsgroup) [14:54:39] (03Merged) 10jenkins-bot: Enable fine grained usage tracking in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402823 (https://phabricator.wikimedia.org/T172914) (owner: 10Ladsgroup) [14:54:41] (03CR) 10jenkins-bot: Update logo for chrwiki, add the HD version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401593 (https://phabricator.wikimedia.org/T180553) (owner: 10Urbanecm) [14:55:24] Amir1: syncing it [14:56:10] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable fine grained usage tracking in hewiki - T172914 (duration: 00m 50s) [14:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:22] T172914: [Tracking] Fine-grained change notifications based on tracking from Lua getters via __index - https://phabricator.wikimedia.org/T172914 [14:57:07] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:57:17] !log rolling reboot of maps servers for kernel upgrade [14:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:32] <_joe_> bast3002 might be me [14:58:00] <_joe_> yeah, fixing [14:58:52] (03CR) 10Hashar: [C: 032] Add test2wiki as a group1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402445 (https://phabricator.wikimedia.org/T182326) (owner: 10Ladsgroup) [14:59:02] Amir1: and doing the last one ! :) [14:59:24] amazing, thanks [14:59:47] hey gehel and Jayprakash12345 thanks for doing the deploy -- sorry I couldn't be more available. I was expecting to be on a plane right now, but we're now delayed by 2+ . hours due to fog :-/ [15:00:20] debt: all that technology, and you're grounded because of fog... [15:00:30] (03Merged) 10jenkins-bot: Add test2wiki as a group1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402445 (https://phabricator.wikimedia.org/T182326) (owner: 10Ladsgroup) [15:00:52] gehel: and it's fog in SF, not here in Denver! [15:00:59] (03PS1) 10Giuseppe Lavagetto: site.pp: fixup for Icd70ef861dcadeeae7df0415a5c2779679c5e144 [puppet] - 10https://gerrit.wikimedia.org/r/402836 [15:01:36] PROBLEM - puppet last run on sarin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:02:37] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:02:43] <_joe_> that's still me, fixing [15:03:11] (03PS2) 10Giuseppe Lavagetto: site.pp: fixup for Icd70ef861dcadeeae7df0415a5c2779679c5e144 [puppet] - 10https://gerrit.wikimedia.org/r/402836 [15:03:28] !log hashar@tin Synchronized dblists/group1-wikipedia.dblist: Add test2wiki as a group1 wiki - T182326 (duration: 00m 50s) [15:03:31] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks simpler than before so that's nice." [puppet] - 10https://gerrit.wikimedia.org/r/402345 (owner: 10Giuseppe Lavagetto) [15:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:39] T182326: Make one group1 wiki a client of testwikidata (preferably a test wiki) - https://phabricator.wikimedia.org/T182326 [15:04:30] (03PS3) 10Giuseppe Lavagetto: site.pp: fixup for Icd70ef861dcadeeae7df0415a5c2779679c5e144 [puppet] - 10https://gerrit.wikimedia.org/r/402836 [15:04:43] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] site.pp: fixup for Icd70ef861dcadeeae7df0415a5c2779679c5e144 [puppet] - 10https://gerrit.wikimedia.org/r/402836 (owner: 10Giuseppe Lavagetto) [15:08:46] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:09:49] Amir1: done :] [15:10:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] hiera: port nuyaml to hiera 3 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/402346 (owner: 10Giuseppe Lavagetto) [15:10:31] great thanks! [15:11:36] RECOVERY - puppet last run on sarin is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:12:28] hashar: SWAT is complete, can I continue to reboot servers? [15:12:54] !log cache_upload: upgrade to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267 [15:13:02] (03CR) 10Jcrespo: [C: 032] mariadb: Remove comments about partitioning on db2039 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402599 (https://phabricator.wikimedia.org/T184090) (owner: 10Jcrespo) [15:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:04] T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656 [15:13:18] moritzm: yes! [15:14:01] (03CR) 10Giuseppe Lavagetto: hiera: port nuyaml to hiera 3 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/402346 (owner: 10Giuseppe Lavagetto) [15:16:07] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4027_v4, cp4027_v6 [15:16:07] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4027_v4, cp4027_v6 [15:16:16] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4027_v4, cp4027_v6 [15:16:26] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4027_v4, cp4027_v6 [15:16:28] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4027_v4, cp4027_v6 [15:16:32] (03Merged) 10jenkins-bot: mariadb: Remove comments about partitioning on db2039 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402599 (https://phabricator.wikimedia.org/T184090) (owner: 10Jcrespo) [15:16:46] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4027_v4, cp4027_v6 [15:17:37] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:17:46] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 114 ESP OK [15:17:47] !log jynus@tin Synchronized wmf-config/db-codfw.php: Fix db2039 comments (duration: 00m 50s) [15:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:07] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 114 ESP OK [15:18:07] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 114 ESP OK [15:18:16] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 114 ESP OK [15:18:26] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 114 ESP OK [15:18:36] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 114 ESP OK [15:21:17] (03CR) 10Alexandros Kosiaris: hiera: port nuyaml to hiera 3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/402346 (owner: 10Giuseppe Lavagetto) [15:23:34] !log reboot kafka1013 for kernel updates [15:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:46] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:27:07] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:27:17] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [15:27:26] PROBLEM - Check systemd state on kafka1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:27:27] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad on kafka1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad/producer\.properties [15:28:37] PROBLEM - Host kafka1013 is DOWN: PING CRITICAL - Packet loss = 100% [15:30:06] RECOVERY - Host kafka1013 is UP: PING WARNING - Packet loss = 93%, RTA = 0.24 ms [15:30:26] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [15:30:26] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad on kafka1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad/producer\.properties [15:30:27] RECOVERY - Check systemd state on kafka1013 is OK: OK - running: The system is fully operational [15:30:48] I added downtime for kafka1013 [15:31:01] (03PS1) 10Giuseppe Lavagetto: site.pp: one role() call in ruthenium, tungsten [puppet] - 10https://gerrit.wikimedia.org/r/402840 [15:31:03] (03PS1) 10Giuseppe Lavagetto: site.pp: one role called with role() for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/402841 [15:31:34] (03CR) 10jerkins-bot: [V: 04-1] site.pp: one role() call in ruthenium, tungsten [puppet] - 10https://gerrit.wikimedia.org/r/402840 (owner: 10Giuseppe Lavagetto) [15:31:38] (03CR) 10jerkins-bot: [V: 04-1] site.pp: one role called with role() for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/402841 (owner: 10Giuseppe Lavagetto) [15:32:37] (03PS1) 10Ottomata: Move role refinery::job::* to profiles [puppet] - 10https://gerrit.wikimedia.org/r/402843 (https://phabricator.wikimedia.org/T167790) [15:33:05] (03CR) 10jerkins-bot: [V: 04-1] Move role refinery::job::* to profiles [puppet] - 10https://gerrit.wikimedia.org/r/402843 (https://phabricator.wikimedia.org/T167790) (owner: 10Ottomata) [15:33:09] (03CR) 10Ottomata: [C: 031] site.pp: one role called with role() for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/402841 (owner: 10Giuseppe Lavagetto) [15:33:29] <_joe_> ottomata: thanks [15:35:21] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] "Overriding jenkins for the styleguide as it's a temporary situation" [puppet] - 10https://gerrit.wikimedia.org/r/402840 (owner: 10Giuseppe Lavagetto) [15:37:39] (03PS2) 10Ottomata: Move role refinery::job::* to profiles [puppet] - 10https://gerrit.wikimedia.org/r/402843 (https://phabricator.wikimedia.org/T167790) [15:38:51] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] "Again, overriding jenkins for our own good." [puppet] - 10https://gerrit.wikimedia.org/r/402841 (owner: 10Giuseppe Lavagetto) [15:39:04] 10Operations, 10Dumps-Generation: Reboot snapshot*, dumpsdata*, dataset1001, ms1001, francium - https://phabricator.wikimedia.org/T184443#3883024 (10ArielGlenn) p:05Triage>03Normal [15:40:06] (03PS3) 10Ottomata: Move role refinery::job::* to profiles [puppet] - 10https://gerrit.wikimedia.org/r/402843 (https://phabricator.wikimedia.org/T167790) [15:41:05] (03CR) 10Elukey: Move role refinery::job::* to profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402843 (https://phabricator.wikimedia.org/T167790) (owner: 10Ottomata) [15:42:04] (03PS4) 10Ottomata: Move role refinery::job::* to profiles [puppet] - 10https://gerrit.wikimedia.org/r/402843 (https://phabricator.wikimedia.org/T167790) [15:42:10] (03CR) 10Ottomata: "Cool, done. ::guard isn't used anywhere for now anyway..." [puppet] - 10https://gerrit.wikimedia.org/r/402843 (https://phabricator.wikimedia.org/T167790) (owner: 10Ottomata) [15:42:36] (03CR) 10Ottomata: "Looks like a no-op to me! :) https://puppet-compiler.wmflabs.org/compiler03/9627/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/402843 (https://phabricator.wikimedia.org/T167790) (owner: 10Ottomata) [15:43:00] (03CR) 10Ottomata: [C: 032] Move role refinery::job::* to profiles [puppet] - 10https://gerrit.wikimedia.org/r/402843 (https://phabricator.wikimedia.org/T167790) (owner: 10Ottomata) [15:43:22] _joe_: puppet merging yours [15:43:37] <_joe_> ottomata: oh thanks sorry, got nerd-sniped [15:43:43] :) np [15:44:15] <_joe_> I was feeling too good about the whole thing in fact [15:44:38] (03PS2) 10Giuseppe Lavagetto: wmflib: simplify the role() function, convert to the new API [puppet] - 10https://gerrit.wikimedia.org/r/402345 [15:44:50] 10Operations, 10Puppet: Puppet hosts with their cert revoked can still run puppet - https://phabricator.wikimedia.org/T184444#3883043 (10fgiunchedi) [15:45:28] <_joe_> ottomata: running puppet on stat1005 I saw [15:45:34] <_joe_> Notice: /Stage[main]/Statistics::User/User[stats]/groups: groups changed '' to ['wikidev'] [15:45:44] <_joe_> and Notice: /Stage[main]/Packages::Libgsl0_dev/Package[libgsl0-dev]/ensure: created [15:46:04] <_joe_> which seem to happen at every puppet run [15:46:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsoid, 10Patch-For-Review: decom wtp1001-wtp1024 - https://phabricator.wikimedia.org/T177374#3883057 (10akosiaris) Yeah, this has uncovered an unfortunate issue in our decomissioning/reimaging process. See T184444 for more info [15:47:55] (03PS1) 10Ottomata: Parametersize kafka_cluster_name in refinery job camus [puppet] - 10https://gerrit.wikimedia.org/r/402847 (https://phabricator.wikimedia.org/T166248) [15:48:35] 10Operations, 10HHVM, 10Patch-For-Review, 10Performance-Team (Radar): HHVM hangs on the API cluster - https://phabricator.wikimedia.org/T184048#3883062 (10Imarlier) [15:48:41] (03PS2) 10Ottomata: Parametersize kafka_cluster_name in refinery job camus [puppet] - 10https://gerrit.wikimedia.org/r/402847 (https://phabricator.wikimedia.org/T166248) [15:48:44] (03PS1) 10Giuseppe Lavagetto: bastionhost::twofa: re-add the new install access keystone [puppet] - 10https://gerrit.wikimedia.org/r/402848 [15:48:46] <_joe_> andrewbogott: ^^ [15:49:37] (03CR) 10Andrew Bogott: [C: 031] "Looks, right -- thanks." [puppet] - 10https://gerrit.wikimedia.org/r/402848 (owner: 10Giuseppe Lavagetto) [15:50:42] (03CR) 10Alexandros Kosiaris: [C: 04-1] hiera: first step of simplification (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402347 (owner: 10Giuseppe Lavagetto) [15:51:16] (03CR) 10Ottomata: [V: 032 C: 032] "Looks fine: https://puppet-compiler.wmflabs.org/compiler03/9628/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/402847 (https://phabricator.wikimedia.org/T166248) (owner: 10Ottomata) [15:52:08] _joe_: huh uhh, puppet-merge didn't merge when i said [15:52:08] yes [15:52:09] OH [15:52:10] right [15:52:11] multiple :p [15:52:13] uhhh ok merging [15:52:16] lol [15:52:27] actually _joe_ i'm merging something that will break puppet on stat1005 [15:52:28] will fix real quikc [15:52:33] i'll make sure puppet runs there [15:53:57] (03PS1) 10Ottomata: Move refinery::job::data_check from stat1005 to analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/402853 (https://phabricator.wikimedia.org/T167790) [15:54:54] (03CR) 10Ottomata: [C: 032] Move refinery::job::data_check from stat1005 to analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/402853 (https://phabricator.wikimedia.org/T167790) (owner: 10Ottomata) [15:58:54] (03PS1) 10Ottomata: Render role's analytics refinery logrotate from profile [puppet] - 10https://gerrit.wikimedia.org/r/402857 (https://phabricator.wikimedia.org/T167790) [15:58:58] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/refinery] [16:00:05] (03CR) 10Ottomata: [C: 032] Render role's analytics refinery logrotate from profile [puppet] - 10https://gerrit.wikimedia.org/r/402857 (https://phabricator.wikimedia.org/T167790) (owner: 10Ottomata) [16:03:58] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:04:36] (03PS1) 10Ema: lvs: rename lvs1007 eth interfaces [puppet] - 10https://gerrit.wikimedia.org/r/402859 (https://phabricator.wikimedia.org/T167299) [16:11:32] (03PS2) 10Giuseppe Lavagetto: bastionhost::twofa: re-add the new install access keystone [puppet] - 10https://gerrit.wikimedia.org/r/402848 [16:12:18] PROBLEM - Nginx local proxy to apache on mw2144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:08] RECOVERY - Nginx local proxy to apache on mw2144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.200 second response time [16:16:23] (03CR) 10Giuseppe Lavagetto: [C: 032] bastionhost::twofa: re-add the new install access keystone [puppet] - 10https://gerrit.wikimedia.org/r/402848 (owner: 10Giuseppe Lavagetto) [16:16:25] (03PS12) 10Paladox: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665 [16:18:23] 10Operations, 10Traffic, 10Performance-Team (Radar): Upgrade cache_text to Varnish 5 - https://phabricator.wikimedia.org/T184448#3883208 (10ema) p:05Triage>03Normal [16:28:29] !log About to run refreshFileHeaders.php on all wikis to fix https://phabricator.wikimedia.org/T178849 [16:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:16] 10Operations, 10monitoring: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690#3883279 (10fgiunchedi) a:03fgiunchedi [16:36:07] !log stopping replication on db2040 [16:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:39] (03PS1) 10Alexandros Kosiaris: DNM: Dummy testing change [puppet] - 10https://gerrit.wikimedia.org/r/402861 [16:40:02] (03CR) 10jerkins-bot: [V: 04-1] DNM: Dummy testing change [puppet] - 10https://gerrit.wikimedia.org/r/402861 (owner: 10Alexandros Kosiaris) [16:40:59] 10Operations, 10monitoring: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690#3699692 (10akosiaris) T180784 has some interesting discussion as well. [16:41:36] 10Puppet, 10Analytics, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-kafka03 due to full disk - https://phabricator.wikimedia.org/T184235#3883319 (10fdans) [16:45:30] volans: Did we already discuss doing something like https://phabricator.wikimedia.org/T184456 and decide against it? [16:45:56] !log milimetric@tin Started deploy [analytics/refinery@f99e7dd]: Update and re-run interlanguage job [16:45:58] (03PS2) 10Alexandros Kosiaris: DNM: Dummy testing change [puppet] - 10https://gerrit.wikimedia.org/r/402861 [16:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:22] (03CR) 10jerkins-bot: [V: 04-1] DNM: Dummy testing change [puppet] - 10https://gerrit.wikimedia.org/r/402861 (owner: 10Alexandros Kosiaris) [16:50:35] andrewbogott: the A:all alias is there just for that no? [16:50:40] all: O{*} and not O{project:contintcloud} and not O{project:admin-monitoring} [16:50:47] volans: oh! [16:50:55] yes, it is, I've just been using '*' instead because I forgot [16:51:04] :) [16:51:15] * andrewbogott still training fingers [16:51:16] thx [16:51:37] yw! [16:52:35] (03PS2) 10Andrew Bogott: nova scheduler pool: Add some comments so I remember which hosts are for infra [puppet] - 10https://gerrit.wikimedia.org/r/402356 [16:53:38] (03CR) 10Andrew Bogott: [C: 032] nova scheduler pool: Add some comments so I remember which hosts are for infra [puppet] - 10https://gerrit.wikimedia.org/r/402356 (owner: 10Andrew Bogott) [16:56:44] (03PS2) 10Andrew Bogott: bootstrapvz: remove ldap setup from firstboot script [puppet] - 10https://gerrit.wikimedia.org/r/401633 (https://phabricator.wikimedia.org/T181375) [16:57:11] (03CR) 10Andrew Bogott: [C: 032] bootstrapvz: remove ldap setup from firstboot script [puppet] - 10https://gerrit.wikimedia.org/r/401633 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [16:57:24] !log milimetric@tin Finished deploy [analytics/refinery@f99e7dd]: Update and re-run interlanguage job (duration: 11m 28s) [16:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:48] PROBLEM - pdfrender on scb1003 is CRITICAL: connect to address 10.64.32.153 and port 5252: Connection refused [17:05:48] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3883499 (10Gehel) Playing with jmx_exporter and elasticsearch, it looks like the metrics exposed through the elasticsearch API are... [17:08:47] PROBLEM - Host maps1004 is DOWN: PING CRITICAL - Packet loss = 100% [17:10:37] RECOVERY - Host maps1004 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:16:47] (03PS1) 10Imarlier: modules/webperf: handle oversamples differently than regular samples [puppet] - 10https://gerrit.wikimedia.org/r/402867 (https://phabricator.wikimedia.org/T181413) [17:18:26] andrewbogott: and just to mention it, feel free to use the timeout options (there are 2) in cumin [17:28:32] 10Operations, 10Developer-Relations: Discourse migration from wmflabs to production - https://phabricator.wikimedia.org/T184461#3883546 (10Qgil) p:05Triage>03Lowest [17:28:41] 10Operations, 10Developer-Relations: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3883561 (10Qgil) [17:28:46] 10Operations, 10Developer-Relations: Discourse migration from wmflabs to production - https://phabricator.wikimedia.org/T184461#3883546 (10Qgil) 05Open>03stalled [17:42:38] PROBLEM - HP RAID on db2060 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:3 - OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [17:42:40] ACKNOWLEDGEMENT - HP RAID on db2060 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:3 - OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T184464 [17:42:44] 10Operations, 10ops-codfw: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T184464#3883603 (10ops-monitoring-bot) [17:43:48] marostegui: ^^^ all yours ;) [17:43:48] 10Operations, 10ops-codfw: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T184464#3883607 (10Marostegui) p:05Triage>03High a:03Papaul I am raising this to High Priority because the warranty expires 14th Jan 2018 [17:44:21] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T184464#3883611 (10Marostegui) [17:54:03] 10Operations, 10Continuous-Integration-Infrastructure (shipyard): npm 1.4.21 can't use a http proxy - https://phabricator.wikimedia.org/T183569#3883638 (10hashar) a:05Joe>03None Resetting assignee, came from the parent task. Potentially we could rebuild the Jessie package `node-tunnel-agent` with patch h... [17:54:12] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): npm 1.4.21 can't use a http proxy - https://phabricator.wikimedia.org/T183569#3883640 (10hashar) [18:00:05] gehel: (Dis)respected human, time to deploy Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180108T1800). Please do the needful. [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:06] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Allow "releasers-mediawiki" sudo rights to manage Jenkins - https://phabricator.wikimedia.org/T183972#3883662 (10RobH) Please note this was approved in the ops meeting (typo to fix in patchset). I'm... [18:00:14] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production SSH, statistics-privatedata-users, analytics-privatedata-users, perf-team for imarlier - https://phabricator.wikimedia.org/T184190#3883663 (10RobH) Please note this was approved in the ops meeting (typo to fix in patchse... [18:02:20] 10Operations, 10Access-Policy, 10Phabricator: please add Casey Dentinger to Phabricator Security Project - https://phabricator.wikimedia.org/T184465#3883673 (10Jgreen) [18:02:31] jouncebot: o/ [18:02:31] (03PS1) 10Jcrespo: mariadb: Depool db2040 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402875 (https://phabricator.wikimedia.org/T176243) [18:05:15] (03PS9) 10Albert221: Remove language button from Wikidata and MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402643 (https://phabricator.wikimedia.org/T183665) [18:07:22] (03PS2) 10Jcrespo: mariadb: Depool db2040 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402875 (https://phabricator.wikimedia.org/T176243) [18:08:01] (03Abandoned) 10Imarlier: webperf.py: Handle oversamples differently than regular samples [puppet] - 10https://gerrit.wikimedia.org/r/394375 (https://phabricator.wikimedia.org/T181413) (owner: 10Imarlier) [18:08:05] !log gehel@tin Started deploy [wdqs/wdqs@c680f55]: (no justification provided) [18:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:43] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2040 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402875 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [18:10:08] !log gehel@tin Finished deploy [wdqs/wdqs@c680f55]: (no justification provided) (duration: 02m 03s) [18:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:37] SMalyshev: deployment completed, tests are green... [18:12:14] (03Merged) 10jenkins-bot: mariadb: Depool db2040 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402875 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [18:15:10] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2040 (duration: 00m 50s) [18:15:15] 10Operations, 10DNS, 10Mail, 10Traffic: Disavow emails from wikipedia.com - https://phabricator.wikimedia.org/T184230#3876973 (10CCogdill_WMF) Hi all--just confirming we use the wikipedia.org domain for fundraising emails, but never wikipedia.com. +1 to strengthening DMARC and SPF rules. [18:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:41] gehel: thank you! [18:17:10] 10Operations: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#3883750 (10Aklapper) #Operations I guess [18:19:58] (03CR) 10Zppix: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402643 (https://phabricator.wikimedia.org/T183665) (owner: 10Albert221) [18:20:24] Albert221: ^ [18:23:45] no_justification hi, wondering if you could review https://gerrit.wikimedia.org/r/#/c/402665/ please? :) [18:24:56] I don't have the time to, "pretty login page" is very very far down my priority list [18:34:05] (03CR) 10jenkins-bot: mariadb: Remove comments about partitioning on db2039 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402599 (https://phabricator.wikimedia.org/T184090) (owner: 10Jcrespo) [18:34:48] (03CR) 10jenkins-bot: Add test2wiki as a group1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402445 (https://phabricator.wikimedia.org/T182326) (owner: 10Ladsgroup) [18:35:31] (03CR) 10jenkins-bot: mariadb: Depool db2040 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402875 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [18:36:06] (03CR) 10jenkins-bot: Enable commons import in tawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399556 (https://phabricator.wikimedia.org/T181774) (owner: 10Jayprakash12345) [18:36:41] (03CR) 10jenkins-bot: Add new namespace aliases on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400267 (https://phabricator.wikimedia.org/T183711) (owner: 10Jayprakash12345) [18:37:06] (03CR) 10jenkins-bot: Turn on mapframe for Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400682 (https://phabricator.wikimedia.org/T183764) (owner: 10Jayprakash12345) [18:37:36] (03CR) 10jenkins-bot: Add Translation: namespace on Punjabi Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/389433 (https://phabricator.wikimedia.org/T179807) (owner: 10Jayprakash12345) [18:38:15] (03CR) 10jenkins-bot: Enable fine grained usage tracking in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402823 (https://phabricator.wikimedia.org/T172914) (owner: 10Ladsgroup) [18:54:01] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: New WDQS clusters eqiad + codfw - https://phabricator.wikimedia.org/T182991#3883867 (10Gehel) >>! In T182991#3851834, @Lucas_Werkmeister_WMDE wrote: > Can you perhaps briefly explain how the specs compare to the existing WDQS cl... [18:58:35] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T184285#3883874 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete. [19:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180108T1900). [19:00:06] Albert221: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:56] if I understand everything, I will be given the backend name and I have to select it in X-Wikimedia-Debug browser extension to "load" wiki from this backend, right? [19:01:08] (03CR) 10Markusguenther: "@Paladox The library is used because the TYPO3 Server team was not willing to adjust gerrit itself to be able to update without hazel. And" [puppet] - 10https://gerrit.wikimedia.org/r/402665 (owner: 10Paladox) [19:03:38] Albert221: Yes that's righ [19:03:47] And also change the "OFF" button to "ON" [19:03:56] Usually the backend is mwdebug1002 [19:06:09] I can SWAT if your response doesn't mean you're already doing it RoanKattouw [19:06:53] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: New WDQS clusters eqiad + codfw - https://phabricator.wikimedia.org/T182991#3883892 (10Lucas_Werkmeister_WMDE) Okay, all of that sounds reasonable enough :) thank you! [19:07:04] I'll take that as a resounding "go for it" :) [19:07:13] (03PS10) 10Thcipriani: Remove language button from Wikidata and MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402643 (https://phabricator.wikimedia.org/T183665) (owner: 10Albert221) [19:07:25] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402643 (https://phabricator.wikimedia.org/T183665) (owner: 10Albert221) [19:09:10] I'm busy so go for it [19:10:13] cool, doing so :) [19:13:07] PROBLEM - HHVM rendering on mw2150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:13:57] RECOVERY - HHVM rendering on mw2150 is OK: HTTP OK: HTTP/1.1 200 OK - 78828 bytes in 0.286 second response time [19:14:16] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2734568 (10Krenair) lucky you didn't go with nvidia: https://www.theregister.co.uk/2018/01/03/nvidia_server_gpus/ [19:14:19] (03CR) 10Dbrant: [C: 031] Add DELETE to list of allowed methods for text varnish [puppet] - 10https://gerrit.wikimedia.org/r/402433 (https://phabricator.wikimedia.org/T182825) (owner: 10Gergő Tisza) [19:14:28] (03Merged) 10jenkins-bot: Remove language button from Wikidata and MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402643 (https://phabricator.wikimedia.org/T183665) (owner: 10Albert221) [19:15:32] (03CR) 10Jdlrobson: Remove language button from Wikidata and MediaWiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402643 (https://phabricator.wikimedia.org/T183665) (owner: 10Albert221) [19:15:41] Albert221: your change ^ is live on mwdebug1002, check please. If all looks as expected let me know and I will deploy everywhere. [19:15:53] sure, doing that now [19:16:42] (03CR) 10jenkins-bot: Remove language button from Wikidata and MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402643 (https://phabricator.wikimedia.org/T183665) (owner: 10Albert221) [19:18:00] for the record: https://i.imgur.com/wlG0vv7.png and https://i.imgur.com/LcljGsQ.png everything works [19:18:04] Albert221: tested myself and looks good! [19:18:15] (checked enwiki as well to check it still shows there) [19:19:20] cool, going live everywhere :) [19:19:33] nice! :) [19:22:26] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:402643|Remove language button from Wikidata and MediaWiki]] T183665 (duration: 00m 51s) [19:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:37] ^ Albert221 should be live everywhere now [19:22:38] T183665: language icon appears in Wikidata on mobile web, but doesn't do anything - https://phabricator.wikimedia.org/T183665 [19:22:44] thanks for the patch! [19:22:44] so it is done now? :O :) [19:22:47] yay! [19:22:57] it was very interesting experience! :D [19:23:43] yep, that patch is all done, live in production :) [19:24:05] (03CR) 10Ottomata: [C: 032] Fixes to better configure hadoop.proxyuser [puppet/cdh] - 10https://gerrit.wikimedia.org/r/402424 (owner: 10Ottomata) [19:24:15] (03PS2) 10Ottomata: Allow superset to submit jobs to Hadoop as logged in users [puppet] - 10https://gerrit.wikimedia.org/r/402425 [19:24:45] (03CR) 10jerkins-bot: [V: 04-1] Allow superset to submit jobs to Hadoop as logged in users [puppet] - 10https://gerrit.wikimedia.org/r/402425 (owner: 10Ottomata) [19:27:01] (03PS3) 10Ottomata: Allow superset to submit jobs to Hadoop as logged in users [puppet] - 10https://gerrit.wikimedia.org/r/402425 [19:29:18] (03CR) 10Ottomata: [C: 032] Allow superset to submit jobs to Hadoop as logged in users [puppet] - 10https://gerrit.wikimedia.org/r/402425 (owner: 10Ottomata) [19:31:14] (03PS1) 10Ottomata: Fix typo in parameter name [puppet/cdh] - 10https://gerrit.wikimedia.org/r/402895 [19:31:22] (03CR) 10Ottomata: [V: 032 C: 032] Fix typo in parameter name [puppet/cdh] - 10https://gerrit.wikimedia.org/r/402895 (owner: 10Ottomata) [19:31:43] (03PS1) 10Ottomata: Update cdh to fix typo in parameter name [puppet] - 10https://gerrit.wikimedia.org/r/402896 [19:31:50] (03CR) 10Ottomata: [V: 032 C: 032] Update cdh to fix typo in parameter name [puppet] - 10https://gerrit.wikimedia.org/r/402896 (owner: 10Ottomata) [19:33:58] PROBLEM - puppet last run on furud is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:34:11] !log rebooting analytics1002 and then analytics1001 to apply proxyuser changes and kernel update [19:34:17] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:49] ottomata: furud alert above is you [19:34:56] Error 500 on SERVER: Server Error: Evaluation Error: Error whil [19:34:57] e evaluating a Resource Statement, Class[Cdh::Hadoop]: has no parameter named 'core_site_extra_properties [19:35:48] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:36:12] yeahhh it'll be fixed by most recent commit [19:36:47] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:36:57] PROBLEM - puppet last run on druid1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:38:03] hey jdlrobson, would you approve task on GCI? :) [19:40:06] Albert221: i can link me [19:40:18] https://codein.withgoogle.com/dashboard/task-instances/6169486744879104/ [19:40:24] Albert221: also good job getting your first? Swat done :) [19:40:45] it was really nice experience, really really! :D :D [19:41:22] jdlrobson: i went ahead and approved for you :) [19:41:55] Zppix: thanks! [19:44:28] RECOVERY - Long running screen/tmux on analytics1003 is OK: OK: No SCREEN or tmux processes detected. [19:53:00] (03PS1) 10Alex Monk: Copy deployment-mx file to deployment-mx02 [puppet] - 10https://gerrit.wikimedia.org/r/402899 (https://phabricator.wikimedia.org/T184244) [19:54:25] (03CR) 10Andrew Bogott: [C: 032] Copy deployment-mx file to deployment-mx02 [puppet] - 10https://gerrit.wikimedia.org/r/402899 (https://phabricator.wikimedia.org/T184244) (owner: 10Alex Monk) [19:54:44] (03PS3) 10Andrew Bogott: Followup Ia5d07908: Fix sentry's base::service_unit to require correct class [puppet] - 10https://gerrit.wikimedia.org/r/372495 (https://phabricator.wikimedia.org/T173554) (owner: 10Alex Monk) [19:55:24] (03CR) 10Andrew Bogott: [C: 032] Followup Ia5d07908: Fix sentry's base::service_unit to require correct class [puppet] - 10https://gerrit.wikimedia.org/r/372495 (https://phabricator.wikimedia.org/T173554) (owner: 10Alex Monk) [19:59:41] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T184464#3884113 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your reque... [20:00:43] 10Operations, 10Ops-Access-Requests: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3884118 (10cy534) [20:01:58] RECOVERY - puppet last run on druid1003 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [20:02:03] (03PS1) 10Andrew Bogott: labs.yaml: set prometheus_nodes to an empty list [puppet] - 10https://gerrit.wikimedia.org/r/402901 [20:02:41] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-eventlogging04 due to missing repo on deployment-tin? - https://phabricator.wikimedia.org/T184238#3877100 (10mmodell) I replaced the DEPLOY_HEAD file by running `scap deploy` and then I ran into a different error, which I'm fixing, then this... [20:02:51] (03CR) 10Andrew Bogott: [C: 032] labs.yaml: set prometheus_nodes to an empty list [puppet] - 10https://gerrit.wikimedia.org/r/402901 (owner: 10Andrew Bogott) [20:03:58] RECOVERY - puppet last run on furud is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:04:18] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:05:48] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:06:47] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:24:15] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-eventlogging04 due to missing repo on deployment-tin? - https://phabricator.wikimedia.org/T184238#3884195 (10mmodell) Now I get an error because /var/lib/superset does not exist: ``` Notice: /Stage[main]/Superset/Exec[init_superset]/returns:... [20:24:17] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3884196 (10Cmjohnson) [20:26:00] 10Operations, 10Commons, 10Multimedia, 10Traffic, and 4 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3884200 (10Gilles) Thumbor isn't involved with transcodes, only thumbnails. Taking care of transcode logic for this should all happen wit... [20:30:07] 10Operations, 10Commons, 10Multimedia, 10Traffic, and 4 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3884214 (10Gilles) As for thumbnails, it's not that it's hard to implement, but you'll create a lot of extra purge traffic and cache inva... [20:36:02] (03PS1) 10Gilles: Upgrade to 1.10 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/402906 (https://phabricator.wikimedia.org/T183907) [20:41:05] akosiaris: mind if I re-enable puppet on builder08 and builder05? (Nothing urgent, just trying to keep things puppetized) [20:42:06] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3884249 (10Shilad) @dr0ptp4kt Thanks for chiming in, and I hope the leave was restful! My read on the AMD + tensorflow / keras situation is that support is movi... [20:44:15] (03PS1) 10Milimetric: [WIP] Don't merge [puppet] - 10https://gerrit.wikimedia.org/r/402907 [20:48:58] (03PS1) 10Andrew Bogott: shinken: remove references to 'wmt' project [puppet] - 10https://gerrit.wikimedia.org/r/402908 (https://phabricator.wikimedia.org/T184449) [20:49:23] (03CR) 10jerkins-bot: [V: 04-1] shinken: remove references to 'wmt' project [puppet] - 10https://gerrit.wikimedia.org/r/402908 (https://phabricator.wikimedia.org/T184449) (owner: 10Andrew Bogott) [20:54:13] (03PS2) 10Andrew Bogott: shinken: remove references to 'wmt' project [puppet] - 10https://gerrit.wikimedia.org/r/402908 (https://phabricator.wikimedia.org/T184449) [20:54:17] (03PS1) 10Andrew Bogott: ircbot: remove a duplicate entry [puppet] - 10https://gerrit.wikimedia.org/r/402909 [20:54:17] (03PS1) 10Andrew Bogott: nagios: remove defines for the 'wmt' wmcs project [puppet] - 10https://gerrit.wikimedia.org/r/402910 (https://phabricator.wikimedia.org/T184449) [20:56:23] (03CR) 10Alex Monk: [C: 031] ircbot: remove a duplicate entry [puppet] - 10https://gerrit.wikimedia.org/r/402909 (owner: 10Andrew Bogott) [20:56:53] (03CR) 10Alex Monk: [C: 031] shinken: remove references to 'wmt' project [puppet] - 10https://gerrit.wikimedia.org/r/402908 (https://phabricator.wikimedia.org/T184449) (owner: 10Andrew Bogott) [20:57:18] (03CR) 10Alex Monk: [C: 031] nagios: remove defines for the 'wmt' wmcs project [puppet] - 10https://gerrit.wikimedia.org/r/402910 (https://phabricator.wikimedia.org/T184449) (owner: 10Andrew Bogott) [20:58:39] (03PS2) 10Andrew Bogott: ircbot: add a separate logfile for #wikimedia-cloud-feed [puppet] - 10https://gerrit.wikimedia.org/r/402909 [20:58:41] (03PS3) 10Andrew Bogott: shinken: remove references to 'wmt' project [puppet] - 10https://gerrit.wikimedia.org/r/402908 (https://phabricator.wikimedia.org/T184449) [20:58:43] (03PS2) 10Andrew Bogott: nagios: remove defines for the 'wmt' wmcs project [puppet] - 10https://gerrit.wikimedia.org/r/402910 (https://phabricator.wikimedia.org/T184449) [20:59:24] (03CR) 10Andrew Bogott: [C: 032] ircbot: add a separate logfile for #wikimedia-cloud-feed [puppet] - 10https://gerrit.wikimedia.org/r/402909 (owner: 10Andrew Bogott) [20:59:31] (03CR) 10Andrew Bogott: [C: 032] shinken: remove references to 'wmt' project [puppet] - 10https://gerrit.wikimedia.org/r/402908 (https://phabricator.wikimedia.org/T184449) (owner: 10Andrew Bogott) [20:59:42] (03CR) 10Andrew Bogott: [C: 032] nagios: remove defines for the 'wmt' wmcs project [puppet] - 10https://gerrit.wikimedia.org/r/402910 (https://phabricator.wikimedia.org/T184449) (owner: 10Andrew Bogott) [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180108T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:01:44] heh nice [21:01:44] (03CR) 10Dzahn: [C: 032] "https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Ingush" [dns] - 10https://gerrit.wikimedia.org/r/402656 (https://phabricator.wikimedia.org/T184374) (owner: 10Urbanecm) [21:02:06] "Your horoscope predicts another unfortunate [...] deploy" [21:02:10] I like these jouncebot messages [21:02:19] ;) [21:05:24] !log new Wikipedia lanuage: "inh" - recreating/reloading DNS zones to add "inh" (Ingush) from langs.tmpl (T184374) https://wikitech.wikimedia.org/wiki/Add_a_wiki#DNS [21:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:35] T184374: Create Wikipedia Ingush - https://phabricator.wikimedia.org/T184374 [21:06:35] heya Krinkle, just to confirm [21:06:43] yall aren't using the eventlogging zeromq endpoing anymore, right? [21:07:56] hmmm wait, yes you are [21:07:57] coal. [21:11:59] !log arlolra@tin Started deploy [parsoid/deploy@1dac474]: Updating Parsoid to e133312 [21:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:33] (03CR) 10Dzahn: "also see ticket comment https://phabricator.wikimedia.org/T184338#3882341" [puppet] - 10https://gerrit.wikimedia.org/r/402583 (https://phabricator.wikimedia.org/T184338) (owner: 10Framawiki) [21:15:31] 10Puppet, 10Beta-Cluster-Infrastructure, 10Services: Puppet disabled for a month on deployment-restbase instances - https://phabricator.wikimedia.org/T184477#3884325 (10Krenair) [21:16:06] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3884337 (10Krenair) [21:16:09] 10Puppet, 10Beta-Cluster-Infrastructure, 10Services: Puppet disabled for a month on deployment-restbase instances - https://phabricator.wikimedia.org/T184477#3884336 (10Krenair) [21:16:29] 10Puppet, 10Beta-Cluster-Infrastructure, 10Services: Puppet disabled for a month on deployment-restbase0[12] instances - https://phabricator.wikimedia.org/T184477#3884325 (10Krenair) [21:21:07] 10Puppet, 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team: Puppet broken on deployment-ores01 due to missing hieradata - https://phabricator.wikimedia.org/T184478#3884352 (10Krenair) p:05Triage>03Normal [21:21:30] 10Puppet, 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team: Puppet broken on deployment-ores01 due to missing hieradata - https://phabricator.wikimedia.org/T184478#3884365 (10Krenair) It actually looks like no one but me has logged onto this thing [21:22:30] !log arlolra@tin Finished deploy [parsoid/deploy@1dac474]: Updating Parsoid to e133312 (duration: 10m 31s) [21:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:49] !log Updated Parsoid to e133312 (T182349, T183893, T159985) [21:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:02] T183893: Block elements within inline elements reported as "Multiple unclosed formatting tags" - https://phabricator.wikimedia.org/T183893 [21:31:02] T182349: Section parsing bug on :en:Wikimedia Foundation - https://phabricator.wikimedia.org/T182349 [21:31:02] T159985: Implement language variant support in the REST API - https://phabricator.wikimedia.org/T159985 [21:31:33] RECOVERY - HP RAID on db2055 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [21:32:10] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3884405 (10Krenair) -snapshot01 is T184270 (package it wants is missing from stretch, moritz to fix when higher priority things are done) [21:32:18] (03PS2) 10Dzahn: mediawiki::jobrunner: move firewall includes to role [puppet] - 10https://gerrit.wikimedia.org/r/399543 [21:37:54] 10Operations, 10monitoring: Icinga check for ipv6 host reachability - https://phabricator.wikimedia.org/T163996#3884413 (10Dzahn) a:03Dzahn [21:39:37] 10Operations: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#3884432 (10Dzahn) [21:40:03] PROBLEM - HHVM jobrunner on mw1308 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [21:40:12] PROBLEM - Nginx local proxy to apache on mw1308 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.005 second response time [21:40:24] 10Operations: hardware request for tin replacement - https://phabricator.wikimedia.org/T184481#3884434 (10Dzahn) [21:40:49] (03PS3) 10RobH: Nightly server: let MW releasers manage Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/399123 (https://phabricator.wikimedia.org/T183972) (owner: 10Chad) [21:41:04] 10Operations: replace bast1001 (new hardware) - https://phabricator.wikimedia.org/T183412#3884448 (10Dzahn) [21:41:06] 10Operations: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#3884422 (10Dzahn) [21:41:12] RECOVERY - HHVM jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [21:41:12] RECOVERY - Nginx local proxy to apache on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time [21:41:26] (03PS4) 10RobH: Nightly server: let MW releasers manage Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/399123 (https://phabricator.wikimedia.org/T183972) (owner: 10Chad) [21:41:50] (03PS1) 10Ottomata: Create profile::cache::kafka::certificate class to DRY require of varnishkafka TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/403059 (https://phabricator.wikimedia.org/T175461) [21:42:16] (03CR) 10RobH: [C: 032] "This was approved in today's ops meeting, so fixed the typo and rebased, now merging." [puppet] - 10https://gerrit.wikimedia.org/r/399123 (https://phabricator.wikimedia.org/T183972) (owner: 10Chad) [21:42:21] (03CR) 10jerkins-bot: [V: 04-1] Create profile::cache::kafka::certificate class to DRY require of varnishkafka TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/403059 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata) [21:42:37] 10Operations, 10Domains, 10Research, 10Traffic, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3884454 (10Dzahn) 05Open>03stalled [21:43:03] no_justification: ^ just merged the access rights additions for mw releasers [21:43:57] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Allow "releasers-mediawiki" sudo rights to manage Jenkins - https://phabricator.wikimedia.org/T183972#3884457 (10RobH) 05Open>03Resolved a:03RobH merged live [21:44:00] (03PS2) 10Ottomata: Create profile::cache::kafka::certificate to DRY require of varnishkafka TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/403059 (https://phabricator.wikimedia.org/T175461) [21:44:01] ottomata: Indeed. For coal. [21:44:07] aye [21:44:19] ottomata: But I'm trying to proritise converting it to Kafka soon. [21:44:24] Given Prometheus is taking longer than expected. [21:44:29] Krinkle: aye cool [21:44:29] (03CR) 10jerkins-bot: [V: 04-1] Create profile::cache::kafka::certificate to DRY require of varnishkafka TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/403059 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata) [21:44:39] FYI, we will soon (?) be moving stuff over to the new jumbo cluster [21:44:41] including eventlogging stuff [21:45:10] doing so might be slightly disruptive, as offsets will be different for consumers. i think for your purposes you almost always start from latest in the stream, right? [21:45:57] (03PS3) 10Ottomata: Create profile::cache::kafka::certificate to DRY require of cert [puppet] - 10https://gerrit.wikimedia.org/r/403059 (https://phabricator.wikimedia.org/T175461) [21:46:23] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Move deployment-prep redis instances to stretch - https://phabricator.wikimedia.org/T179371#3722645 (10mmodell) @fgiunchedi sounds good to me! Puppet is now broken on the old redis nodes,... [21:46:49] (03CR) 1020after4: [C: 031] hieradata: use deployment-redis05 for labs jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/387579 (https://phabricator.wikimedia.org/T179371) (owner: 10Filippo Giunchedi) [21:47:05] (03CR) 1020after4: [C: 031] hieradata: add redis stretch deployment-prep instances [puppet] - 10https://gerrit.wikimedia.org/r/386869 (https://phabricator.wikimedia.org/T179371) (owner: 10Filippo Giunchedi) [21:47:15] !log bsitzmann@tin Started deploy [mobileapps/deploy@1bfd4b0]: Update mobileapps to d20915c (T184430 T184429) [21:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:28] T184429: Ensure "timestamp" refers to the requested revision - https://phabricator.wikimedia.org/T184429 [21:47:28] T184430: Flag project main pages with type=mainpage - https://phabricator.wikimedia.org/T184430 [21:47:53] (03CR) 1020after4: [C: 031] labs: use new redis servers for locks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387570 (https://phabricator.wikimedia.org/T179371) (owner: 10Filippo Giunchedi) [21:47:55] 10Operations, 10Ops-Access-Requests: Requesting extended access to stat1005 for jdcc - https://phabricator.wikimedia.org/T184085#3884474 (10RobH) I've dropped @RStallman-legalteam an email asking about nda expiry check for the actual NDA. Once we have that approval, we should be ok (pending @Slaporte's recomm... [21:48:47] (03CR) 10Dzahn: [C: 032] "noop and as done before for all the other appservers too http://puppet-compiler.wmflabs.org/9632/" [puppet] - 10https://gerrit.wikimedia.org/r/399543 (owner: 10Dzahn) [21:48:56] (03PS3) 10Dzahn: mediawiki::jobrunner: move firewall includes to role [puppet] - 10https://gerrit.wikimedia.org/r/399543 [21:50:11] (03CR) 10Dzahn: "21:49:19 wmf-style: total violations delta -5" [puppet] - 10https://gerrit.wikimedia.org/r/399543 (owner: 10Dzahn) [21:52:48] !log bsitzmann@tin Finished deploy [mobileapps/deploy@1bfd4b0]: Update mobileapps to d20915c (T184430 T184429) (duration: 05m 33s) [21:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:00] T184429: Ensure "timestamp" refers to the requested revision - https://phabricator.wikimedia.org/T184429 [21:53:00] T184430: Flag project main pages with type=mainpage - https://phabricator.wikimedia.org/T184430 [21:53:26] (03PS1) 10Ottomata: Mv varnishkafka profile certificate::ssl_key_password [labs/private] - 10https://gerrit.wikimedia.org/r/403061 [21:54:06] (03CR) 10Ottomata: [V: 032 C: 032] Mv varnishkafka profile certificate::ssl_key_password [labs/private] - 10https://gerrit.wikimedia.org/r/403061 (owner: 10Ottomata) [21:57:06] (03CR) 10Ottomata: "ooook ! https://puppet-compiler.wmflabs.org/compiler02/9634/cp1008.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/403059 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata) [22:00:04] dapatrick, bawolff, and Reedy: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180108T2200). [22:00:04] No GERRIT patches in the queue for this window AFAICS. [22:00:16] (03PS4) 10Dzahn: prometheus: move duplicate include, use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/397727 [22:00:50] lol jouncebot [22:13:20] (03PS1) 10Ottomata: Tweaks to profile::cache::kafka::webrequest::jumbo test [puppet] - 10https://gerrit.wikimedia.org/r/403064 [22:15:18] (03CR) 10Ottomata: "This change will need a commit to the private repo to rename/move the hiera ssl_key_password setting." [puppet] - 10https://gerrit.wikimedia.org/r/403059 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata) [22:16:25] (03PS2) 10Ottomata: Tweaks to profile::cache::kafka::webrequest::jumbo test [puppet] - 10https://gerrit.wikimedia.org/r/403064 [22:18:11] 10Puppet, 10Analytics: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3884526 (10Krenair) [22:21:39] (03PS3) 10Ottomata: Tweaks to profile::cache::kafka::webrequest::jumbo test [puppet] - 10https://gerrit.wikimedia.org/r/403064 [22:22:16] (03CR) 10jerkins-bot: [V: 04-1] Tweaks to profile::cache::kafka::webrequest::jumbo test [puppet] - 10https://gerrit.wikimedia.org/r/403064 (owner: 10Ottomata) [22:23:55] (03PS1) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [22:24:29] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [22:24:53] 10Operations, 10Ops-Access-Requests: Requesting extended access to stat1005 for jdcc - https://phabricator.wikimedia.org/T184085#3884551 (10RStallman-legalteam) Confirming that Justin Clark's NDA is on file. This type of NDA doesn't have an expiration date. Thanks! [22:25:43] (03PS4) 10Ottomata: Tweaks to profile::cache::kafka::webrequest::jumbo test [puppet] - 10https://gerrit.wikimedia.org/r/403064 [22:27:12] (03PS2) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [22:28:58] that's a slight improvement over the earlier jouncebot message "Your horoscope predicts another unfortunate [...] deploy" [22:30:37] (03PS13) 10Paladox: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665 [22:33:07] the list of messages jouncebot pulls from: https://phabricator.wikimedia.org/source/jouncebot/browse/master/DefaultConfig.yaml;64702f99bd2016c76862d8c2af3f30baa83cb219$24-39 [22:41:20] bd808, I love that [22:42:11] wth :) - '{deployers}: I, the Bot under the Fountain, allow thee, The Deployer, to do {event.window} deploy. ({event.url}).' [22:42:19] back from getting drilled on :) [22:43:46] oh my [22:43:57] uh... how are you doing then? [22:44:30] Niharika wrote all the new ones. she got tired of the 2 old messages it had [22:45:38] apergos: numbed and tired :) it's funny how the dentist is tiring even though you are sitting there doing nothing [22:47:32] PROBLEM - nova-compute process on labvirt1011 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [22:48:12] you're enduring, that's a lot of energy right there [22:48:32] RECOVERY - nova-compute process on labvirt1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [22:57:05] (03PS1) 10Rush: WIP: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) [22:57:30] (03CR) 10jerkins-bot: [V: 04-1] WIP: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [22:57:32] (03PS2) 10Rush: WIP: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) [22:57:53] (03CR) 10jerkins-bot: [V: 04-1] WIP: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [23:05:44] (03PS3) 10Rush: WIP: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) [23:06:06] (03CR) 10jerkins-bot: [V: 04-1] WIP: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [23:06:23] (03PS4) 10Rush: WIP: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) [23:06:46] (03CR) 10jerkins-bot: [V: 04-1] WIP: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [23:11:42] (03CR) 10Andrew Bogott: [C: 031] "This seems totally reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [23:14:02] (03PS5) 10Rush: WIP: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) [23:14:24] (03CR) 10jerkins-bot: [V: 04-1] WIP: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [23:15:10] (03PS6) 10Rush: WIP: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) [23:15:33] (03CR) 10jerkins-bot: [V: 04-1] WIP: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [23:16:26] ottomata[m]: I would love it if you could spend a few minutes fixing up or deleting your broken VMs. These things have a tendency to drift and cause us real problems in the long run. https://phabricator.wikimedia.org/T184482 [23:20:58] 10Puppet, 10Analytics: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#3884526 (10Andrew) I enabled puppet on druid-test02 and now it says: ``` Info: Loading facts Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error... [23:27:10] (03PS7) 10Rush: WIP: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) [23:27:32] (03CR) 10jerkins-bot: [V: 04-1] WIP: toolforge: ferm hook to restart components post updates [puppet] - 10https://gerrit.wikimedia.org/r/403072 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [23:35:51] (03PS1) 10Jforrester: Save -> Publish on remaining Wikinewses which haven't updated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403077 [23:45:23] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0 [23:45:32] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [23:46:22] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [23:49:06] (03CR) 10Dzahn: "cant compile prometheus change but that's unrelated http://puppet-compiler.wmflabs.org/9635/prometheus1003.eqiad.wmnet/change.prometheus10" [puppet] - 10https://gerrit.wikimedia.org/r/397727 (owner: 10Dzahn) [23:50:42] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 74.95 ms [23:56:47] !log rutherfordium (people.wm.org) - upgrading PHP5 [23:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log