[00:07:07] (03PS1) 10Dzahn: etherpad: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346921 [00:17:05] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [00:19:42] (03PS1) 10Dzahn: lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923 [00:21:37] (03CR) 10jerkins-bot: [V: 04-1] lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn) [00:22:06] 06Operations, 10Mail, 10Wikimedia-Mailing-lists: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3162334 (10KTC) >>! In T160529#3161835, @grin wrote: > Dropping/autorejecting email with matching header > `​X-Spam-Score: .+\+\+\+\+\+` > (which is above spam scrote 5.00) probably helps a lot.... [00:23:05] PROBLEM - Postgres Replication Lag on maps-test2003 is CRITICAL: CRITICAL - Rep Delay is: 18125.133775 Seconds [00:23:37] (03PS1) 10Dzahn: delete netmon::migration class, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/346924 [00:24:05] RECOVERY - Postgres Replication Lag on maps-test2003 is OK: OK - Rep Delay is: 0.0 Seconds [00:26:01] (03PS2) 10Dzahn: lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923 [00:27:18] (03CR) 10jerkins-bot: [V: 04-1] lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn) [00:29:53] (03PS3) 10Dzahn: lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923 [00:30:55] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [00:33:05] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:34:05] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [00:36:15] (03PS1) 10Dzahn: mw_rc_irc: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346926 [00:36:49] (03Abandoned) 10Dzahn: ssh: avoid hardcoded hostname for yubiauth, add to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/346183 (owner: 10Dzahn) [00:39:55] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 19138.964259 Seconds [00:40:55] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 0.0 Seconds [00:41:35] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 19236.489041 Seconds [00:41:55] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:43:35] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [00:45:05] 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3162366 (10bbogaert) Hi @MoritzMuehlenhoff , ldap1 is running Ubuntu 14.04.5 LTS. Google Cloud Directory Sync [1], to sync users from LDAP with G-Suite, is hosted on... [00:50:05] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:55:43] 06Operations, 06Office-IT, 07LDAP: Remove disabled users from internal mailing lists - https://phabricator.wikimedia.org/T161004#3118086 (10Dzahn) >Once T160158 is implemented, we can add a script to remove all users disabled in corp LDAP from internal mailing lists running on fermium The command to do this... [00:58:35] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 20256.705279 Seconds [00:59:35] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 0.0 Seconds [01:09:55] RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [01:19:05] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [01:29:45] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 76380.517878 Seconds [01:29:55] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 76386.614576 Seconds [01:30:05] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 76394.560024 Seconds [01:30:05] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 83766.445524 Seconds [01:30:05] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 83771.28037 Seconds [01:30:05] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 83771.288633 Seconds [01:38:55] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:41:55] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 77106.563438 Seconds [01:42:05] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [01:42:05] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [01:45:05] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 84671.340698 Seconds [01:45:06] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 84671.353127 Seconds [01:52:05] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 891.28 seconds [01:59:45] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 27.676968 Seconds [01:59:55] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 33.643299 Seconds [02:00:05] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 41.445415 Seconds [02:00:05] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 4.148824 Seconds [02:00:05] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 9.167432 Seconds [02:00:05] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 9.169056 Seconds [02:11:05] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 194.12 seconds [02:25:40] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 09m 54s) [02:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:26] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 7 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3162419 (10tstarling) >>! In T156924#3138751, @tstarling wrote: > ``` > $wgConfigRegistry =... [02:47:25] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:59:05] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:59:39] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.19) (duration: 14m 11s) [02:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:52] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Apr 7 03:04:52 UTC 2017 (duration 5m 13s) [03:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:42] (03PS1) 10BBlack: [WIP] dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937 [03:15:26] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [03:17:05] PROBLEM - puppet last run on mw1285 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:20:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 28915 seconds ago, expected 28800 [03:25:05] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 29215 seconds ago, expected 28800 [03:27:05] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [03:30:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 29515 seconds ago, expected 28800 [03:35:05] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 29815 seconds ago, expected 28800 [03:40:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 30116 seconds ago, expected 28800 [03:45:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 30415 seconds ago, expected 28800 [03:46:05] RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [03:50:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 30715 seconds ago, expected 28800 [03:55:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 31015 seconds ago, expected 28800 [04:00:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 31315 seconds ago, expected 28800 [04:05:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 31616 seconds ago, expected 28800 [04:10:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 31916 seconds ago, expected 28800 [04:13:05] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:15:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 32216 seconds ago, expected 28800 [04:20:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 32516 seconds ago, expected 28800 [04:22:55] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=295.60 Read Requests/Sec=713.40 Write Requests/Sec=0.70 KBytes Read/Sec=36445.20 KBytes_Written/Sec=35.20 [04:25:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 32816 seconds ago, expected 28800 [04:30:05] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 33116 seconds ago, expected 28800 [04:31:55] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.20 Read Requests/Sec=0.60 Write Requests/Sec=0.50 KBytes Read/Sec=16.80 KBytes_Written/Sec=5.20 [04:35:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 33416 seconds ago, expected 28800 [04:40:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 33715 seconds ago, expected 28800 [04:40:25] PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:42:05] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [04:45:05] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 34016 seconds ago, expected 28800 [04:50:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 34316 seconds ago, expected 28800 [04:54:05] PROBLEM - puppet last run on aqs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:55:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 34615 seconds ago, expected 28800 [05:00:05] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 34915 seconds ago, expected 28800 [05:05:05] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 35215 seconds ago, expected 28800 [05:08:25] RECOVERY - puppet last run on db1036 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [05:10:05] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 35515 seconds ago, expected 28800 [05:12:32] <_joe_> I love those FR alerts [05:15:05] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 35815 seconds ago, expected 28800 [05:17:55] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:20:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 36115 seconds ago, expected 28800 [05:22:05] RECOVERY - puppet last run on aqs1004 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [05:25:15] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet last ran 36415 seconds ago, expected 28800 [05:27:47] ACKNOWLEDGEMENT - Host cr2-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.244) Ayounsi https://phabricator.wikimedia.org/T162239 [05:27:47] ACKNOWLEDGEMENT - Host cr2-esams IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:ffff::3) Ayounsi https://phabricator.wikimedia.org/T162239 [05:31:35] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:32:05] PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:40:21] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3162490 (10Papaul) @Gehel @RobH I spoke again yesterday with the HP Engineer that did help me on the lvs2002(T162099) issue about this case an... [05:42:25] (03Abandoned) 10Giuseppe Lavagetto: Add phase-9 varnish puppet run to restore order to dc_from [switchdc] - 10https://gerrit.wikimedia.org/r/346310 (owner: 10Giuseppe Lavagetto) [05:45:32] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2739503 (10Joe) >>! In T149006#3162490, @Papaul wrote: > @Gehel @RobH I spoke again yesterday with the HP Engineer that did help me on the lvs... [05:47:05] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:00:35] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:00:51] (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346945 (https://phabricator.wikimedia.org/T160390) [06:01:05] RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:02:43] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346945 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [06:04:07] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346945 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [06:04:20] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346945 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [06:05:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1094 - T160390 (duration: 00m 49s) [06:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:18] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [06:06:07] !log Deploy schema change db1094 (s7) - T160390 [06:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:51] (03PS6) 10Marostegui: mariadb: Disable trx, sync_binlog for tempdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/346773 (https://phabricator.wikimedia.org/T162290) [06:09:45] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3162515 (10Marostegui) [06:09:56] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3154083 (10Marostegui) >>! In T162159#3160873, @Papaul wrote: > @Marostegui Just for your information > > asw-a2-codfw > asw-a7-codfw > asw-b2-codfw > asw-b7-codfw > as... [06:19:08] (03PS1) 10Giuseppe Lavagetto: Fix the --task switch functionality [switchdc] - 10https://gerrit.wikimedia.org/r/346947 [06:19:19] (03PS1) 10Marostegui: db-codfw.php: Add tempdb2001 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346948 (https://phabricator.wikimedia.org/T162290) [06:20:27] (03CR) 10Marostegui: [C: 04-2] "Do not deploy until Monday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346948 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [06:20:49] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3162523 (10Marostegui) After disabling sync binlog and trx commit yesterday the server caught up. I have enabled gtid as well. I have sent the patch to pool it, but I think we should... [07:02:12] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3162530 (10MoritzMuehlenhoff) This seems fixed, it hasn't deadlocked with live traffic for approx 18 hours \o/ CPU and memory consumption are also stable.... [07:03:14] 06Operations, 10Monitoring, 07HHVM: Monitor/address HHVM bytecode cache depletion on mediawiki app servers - https://phabricator.wikimedia.org/T161598#3162535 (10MoritzMuehlenhoff) [07:04:43] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3162536 (10Joe) @MoritzMuehlenhoff I'm ok with a wider but limited rollout, but at least until the dc switchover and rollback are done, I'd prefer to stick... [07:09:33] (03PS1) 10ArielGlenn: fix min pages per rev count query [dumps] - 10https://gerrit.wikimedia.org/r/346951 [07:10:54] (03PS2) 10ArielGlenn: fix min pages per rev count query [dumps] - 10https://gerrit.wikimedia.org/r/346951 [07:11:15] PROBLEM - puppet last run on mw1196 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:14:00] (03CR) 10ArielGlenn: [C: 032] fix min pages per rev count query [dumps] - 10https://gerrit.wikimedia.org/r/346951 (owner: 10ArielGlenn) [07:16:48] !log ariel@tin Started deploy [dumps/dumps@af61d8d]: handle page range generation for wikis with hundreds of thousands of revisions [07:16:52] !log ariel@tin Finished deploy [dumps/dumps@af61d8d]: handle page range generation for wikis with hundreds of thousands of revisions (duration: 00m 03s) [07:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:19] !log ariel@tin Started deploy [dumps/dumps@af61d8d]: I mean: handle page range generation for wikis with PAGES with hundreds of thousands of revisions [07:17:21] !log ariel@tin Finished deploy [dumps/dumps@af61d8d]: I mean: handle page range generation for wikis with PAGES with hundreds of thousands of revisions (duration: 00m 02s) [07:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:33] (03PS1) 10Gehel: postgresql - clean up of python / bash code after volans review [puppet] - 10https://gerrit.wikimedia.org/r/346952 [07:20:52] (03CR) 10Gehel: "Corrections done in a new change https://gerrit.wikimedia.org/r/#/c/346952/" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346724 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [07:21:45] !log reimporting several damaged db tables on s2 T154485 [07:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:52] T154485: run pt-table-checksum on s2 (WAS: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038) - https://phabricator.wikimedia.org/T154485 [07:32:51] 06Operations, 07HHVM: HHVM lock-ups - https://phabricator.wikimedia.org/T89912#3162572 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff I'm quite confident that this is the same bug as T161684 (which also turned out to be stat_cache related after some investigation) which has now been fixed... [07:37:36] (03PS2) 10Giuseppe Lavagetto: Fix the --task switch functionality [switchdc] - 10https://gerrit.wikimedia.org/r/346947 [07:37:38] (03PS1) 10Giuseppe Lavagetto: Initial stub of a dry_run functionality [switchdc] - 10https://gerrit.wikimedia.org/r/346953 [07:37:53] (03CR) 10jerkins-bot: [V: 04-1] Initial stub of a dry_run functionality [switchdc] - 10https://gerrit.wikimedia.org/r/346953 (owner: 10Giuseppe Lavagetto) [07:39:06] (03PS2) 10Giuseppe Lavagetto: Initial stub of a dry_run functionality [switchdc] - 10https://gerrit.wikimedia.org/r/346953 [07:39:15] RECOVERY - puppet last run on mw1196 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:42:43] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: setup tempdb2001(WMF6407) - https://phabricator.wikimedia.org/T162290#3162594 (10Marostegui) host added to tendril too. [07:47:26] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346954 [07:47:35] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 [07:49:01] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346954 (owner: 10Marostegui) [07:50:18] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346954 (owner: 10Marostegui) [07:50:27] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346954 (owner: 10Marostegui) [07:50:53] !log Deploy  schema change db1039 (already depooled) (s7) - T160390 [07:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:01] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [07:51:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1094 - T160390 (duration: 00m 50s) [07:51:41] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=wdqs [07:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:24] (03CR) 10Hoo man: "@legoktm: Good catch… serves me right, just two hours after complaining about the very same thing. :P" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346711 (owner: 10Hoo man) [07:52:36] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3162601 (10Gehel) Initial import is completed, wdqs-updater is restarted and is catching up on the differences since last export. [08:07:55] PROBLEM - puppet last run on db1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:15:05] !log upgrade mw1262-mw1265 to HHVM 3.18.2 [08:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:41] 06Operations, 10hardware-requests, 15User-Joe: eqiad: 3 hardware request for etcd/zookeeper - https://phabricator.wikimedia.org/T162429#3162693 (10Joe) [08:17:33] 06Operations, 10hardware-requests, 15User-Joe: eqiad: 3 hardware request for etcd/zookeeper - https://phabricator.wikimedia.org/T162429#3162693 (10Joe) [08:28:15] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy [08:28:25] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [08:28:26] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy [08:28:26] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy [08:28:40] nice! Data loaded :) [08:28:43] (03CR) 10Muehlenhoff: [C: 031] "Looks good, for paged search requests that seems fine." [puppet] - 10https://gerrit.wikimedia.org/r/346790 (owner: 10Andrew Bogott) [08:29:15] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy [08:29:15] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy [08:33:16] 06Operations, 10RESTBase, 13Patch-For-Review, 10Scap (Scap3-Adoption-Phase1), and 2 others: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335#3162764 (10akosiaris) Agreed. [08:35:55] RECOVERY - puppet last run on db1037 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:36:00] (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/346947 (owner: 10Giuseppe Lavagetto) [08:36:44] 06Operations, 10netops: cr2-knams<->asw-esams GBLX fiber down - https://phabricator.wikimedia.org/T158647#3162767 (10ayounsi) 05Open>03Resolved From Vancis: > I've traced the whole route coming from L3 towards your cabinet. > The port is now up. The issue was the XFP, I've reseated the optic and the port... [08:41:36] (03CR) 10Volans: [C: 04-1] "#no-global" (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/346953 (owner: 10Giuseppe Lavagetto) [08:41:45] (03CR) 10Jcrespo: [C: 031] "OK. For the actual failover, we may want a 1:10 or 1:5 ratio." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346948 (https://phabricator.wikimedia.org/T162290) (owner: 10Marostegui) [08:54:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] etherpad: convert to profile/role structure (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346921 (owner: 10Dzahn) [09:00:35] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:05] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:07:04] 06Operations, 10Domains, 10Traffic, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3162783 (10Beetlebeard) Thank you!! [09:07:14] 06Operations, 10Domains, 10Traffic, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3162784 (10Beetlebeard) 05Open>03Resolved [09:12:35] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:21:35] (03CR) 10Alexandros Kosiaris: [C: 031] osm - remove dead code [puppet] - 10https://gerrit.wikimedia.org/r/345861 (owner: 10Gehel) [09:22:01] !log Deploy  schema change db1062 (already depooled) (s7) - T160390 [09:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:10] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [09:22:57] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I think this can be abandoned now that RESTBase moves over to scap3" [puppet] - 10https://gerrit.wikimedia.org/r/229306 (https://phabricator.wikimedia.org/T107532) (owner: 10GWicke) [09:24:42] (03PS2) 10Gehel: osm - remove dead code [puppet] - 10https://gerrit.wikimedia.org/r/345861 [09:27:29] (03CR) 10Gehel: [C: 032] osm - remove dead code [puppet] - 10https://gerrit.wikimedia.org/r/345861 (owner: 10Gehel) [09:29:35] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:33:05] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:33:21] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix the --task switch functionality [switchdc] - 10https://gerrit.wikimedia.org/r/346947 (owner: 10Giuseppe Lavagetto) [09:41:35] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:42:52] 06Operations, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947#3162873 (10MoritzMuehlenhoff) [09:43:55] 06Operations, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947#386350 (10MoritzMuehlenhoff) I'll prepare a Pango build to test this on jessie [09:51:25] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:56:09] (03PS3) 10Giuseppe Lavagetto: Initial stub of a dry_run functionality [switchdc] - 10https://gerrit.wikimedia.org/r/346953 [10:00:55] PROBLEM - DPKG on mw1262 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:01:55] RECOVERY - DPKG on mw1262 is OK: All packages OK [10:07:48] (03PS2) 10Alexandros Kosiaris: Move role::backup::host into a profile [puppet] - 10https://gerrit.wikimedia.org/r/346732 [10:07:55] PROBLEM - puppet last run on db1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:09:30] (03PS1) 10Gehel: postgresql - introduce the check-postgres package for postgres monitoring [puppet] - 10https://gerrit.wikimedia.org/r/346962 (https://phabricator.wikimedia.org/T162345) [10:09:33] (03PS1) 10Gehel: [WIP] postgresql - use the check-postgres package for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/346963 (https://phabricator.wikimedia.org/T162345) [10:09:34] (03PS1) 10Gehel: postgresql - cleanup dead code after migration to check-postgres package [puppet] - 10https://gerrit.wikimedia.org/r/346964 (https://phabricator.wikimedia.org/T162345) [10:10:25] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:15:53] (03PS1) 10Sfic: Set wgTranslateNumerals false on bhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346965 (https://phabricator.wikimedia.org/T160098) [10:19:08] (03PS1) 10WMDE-Fisch: Enable alternate RevisionSlider slider on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346966 (https://phabricator.wikimedia.org/T160410) [10:19:25] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [10:23:09] (03CR) 10Addshore: [C: 04-1] Enable alternate RevisionSlider slider on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346966 (https://phabricator.wikimedia.org/T160410) (owner: 10WMDE-Fisch) [10:25:23] (03CR) 10Muehlenhoff: [C: 031] "That's fine. But IMO the access::groups host definitions in Hiera should be part of this patch from the start. Otherwise the scope/impact " [puppet] - 10https://gerrit.wikimedia.org/r/346838 (https://phabricator.wikimedia.org/T162404) (owner: 10Rush) [10:35:04] (03CR) 10Alexandros Kosiaris: [C: 031] postgresql - introduce the check-postgres package for postgres monitoring [puppet] - 10https://gerrit.wikimedia.org/r/346962 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [10:35:55] RECOVERY - puppet last run on db1037 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [10:36:05] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [10:37:05] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [10:38:25] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [10:41:55] PROBLEM - Nginx local proxy to apache on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time [10:42:05] PROBLEM - HHVM rendering on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [10:42:05] PROBLEM - HHVM processes on mw1263 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [10:42:35] PROBLEM - Apache HTTP on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [10:42:55] ^ it's depooled, will silence [10:44:05] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:44:05] RECOVERY - HHVM processes on mw1263 is OK: PROCS OK: 6 processes with command name hhvm [10:46:05] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:47:35] RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.065 second response time [10:47:55] RECOVERY - Nginx local proxy to apache on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.471 second response time [10:48:05] RECOVERY - HHVM rendering on mw1263 is OK: HTTP OK: HTTP/1.1 200 OK - 78893 bytes in 5.314 second response time [10:50:54] (03PS1) 10Volans: Add dry-run mode and uses it [switchdc] - 10https://gerrit.wikimedia.org/r/346968 (https://phabricator.wikimedia.org/T160178) [10:52:49] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3163049 (10elukey) Discovered one thing today (that might be probably be trivial but I... [10:53:51] !log increase Redis connection timeout manually (.3s -> .5s) on mw1306 as performance test - T125735 [10:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:58] T125735: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735 [11:00:16] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-spa-cat] - 10https://gerrit.wikimedia.org/r/346780 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry) [11:00:37] (03PS2) 10Aklapper: Link to Code of Conduct from Phabricator's footer [puppet] - 10https://gerrit.wikimedia.org/r/343749 [11:08:08] (03PS10) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [11:13:46] (03CR) 10Volans: [C: 04-1] "See comments inline, also I'm aware that the .py will be rewritten/removed" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346952 (owner: 10Gehel) [11:18:12] (03CR) 10Giuseppe Lavagetto: "It's less horrible and invasive than we expected." (038 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/346968 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [11:20:27] 06Operations, 06Release-Engineering-Team, 05Goal, 06Services (designing), and 2 others: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3163120 (10MoritzMuehlenhoff) p:05Triage>03High [11:20:37] 06Operations, 05Goal, 07kubernetes: Expand the infrastructure to codfw - https://phabricator.wikimedia.org/T162041#3163135 (10MoritzMuehlenhoff) p:05Triage>03High [11:20:44] 06Operations, 05Goal, 07kubernetes: Prepare to service applications from kubernetes - https://phabricator.wikimedia.org/T162039#3163137 (10MoritzMuehlenhoff) p:05Triage>03High [11:20:50] 06Operations, 05Goal, 07kubernetes: Define a process to keep images up-to-date on similar standards as the rest of production - https://phabricator.wikimedia.org/T162043#3163138 (10MoritzMuehlenhoff) p:05Triage>03High [11:20:51] (03PS3) 10Alexandros Kosiaris: Move role::backup::host into a profile [puppet] - 10https://gerrit.wikimedia.org/r/346732 [11:20:57] 06Operations, 05Goal, 07kubernetes: Design and implement a Kubernetes-based staging environment. (stretch) - https://phabricator.wikimedia.org/T162045#3163139 (10MoritzMuehlenhoff) p:05Triage>03High [11:29:07] (03PS2) 10Gehel: postgresql - introduce the check-postgres package for postgres monitoring [puppet] - 10https://gerrit.wikimedia.org/r/346962 (https://phabricator.wikimedia.org/T162345) [11:39:09] (03PS2) 10WMDE-Fisch: Enable alternate RevisionSlider slider on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346966 (https://phabricator.wikimedia.org/T160410) [11:39:14] (03CR) 10WMDE-Fisch: Enable alternate RevisionSlider slider on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346966 (https://phabricator.wikimedia.org/T160410) (owner: 10WMDE-Fisch) [11:47:06] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3163543 (10elukey) I tried to manually modify `/srv/mediawiki/wmf-config/jobqueue.php`... [11:58:30] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3163583 (10Dvorapa) For me every day new images are broken and those br... [12:03:03] (03PS2) 10Volans: Add dry-run mode and uses it [switchdc] - 10https://gerrit.wikimedia.org/r/346968 (https://phabricator.wikimedia.org/T160178) [12:03:23] (03CR) 10jerkins-bot: [V: 04-1] Add dry-run mode and uses it [switchdc] - 10https://gerrit.wikimedia.org/r/346968 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [12:04:47] (03PS3) 10Volans: Add dry-run mode and uses it [switchdc] - 10https://gerrit.wikimedia.org/r/346968 (https://phabricator.wikimedia.org/T160178) [12:06:24] (03CR) 10Volans: "replies inline" (036 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/346968 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [12:21:35] PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:24:53] (03Abandoned) 10Elukey: Increase Redis connection timeout for MediaWiki Jobrunners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346508 (https://phabricator.wikimedia.org/T125735) (owner: 10Elukey) [12:25:28] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Elukey: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#3163609 (10elukey) So the new hosts are in puppet with role memcached, and since we are very close to the switchover it might be better to skip the switch when... [12:27:25] (03CR) 10Giuseppe Lavagetto: Add dry-run mode and uses it (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/346968 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [12:28:33] (03PS3) 10WMDE-Fisch: Enable alternate RevisionSlider slider on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346966 (https://phabricator.wikimedia.org/T160410) [12:31:25] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:37:36] (03CR) 10Addshore: [C: 032] Enable alternate RevisionSlider slider on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346966 (https://phabricator.wikimedia.org/T160410) (owner: 10WMDE-Fisch) [12:38:02] (03CR) 10Addshore: Enable alternate RevisionSlider slider on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346966 (https://phabricator.wikimedia.org/T160410) (owner: 10WMDE-Fisch) [12:44:01] (03PS4) 10WMDE-Fisch: Enable alternate RevisionSlider slider on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346966 (https://phabricator.wikimedia.org/T160410) [12:44:55] (03CR) 10Addshore: [C: 032] Enable alternate RevisionSlider slider on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346966 (https://phabricator.wikimedia.org/T160410) (owner: 10WMDE-Fisch) [12:45:30] !log banning cache_upload obj.http.Content-type ~ text [12:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:35] !log banning cache_upload obj.http.Content-type == text/html [12:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:59] hmm that's not quite right either [12:47:12] (what I realized after the first one is these bans might be catching SVGs, too) [12:48:32] !log banning cache_upload obj.http.Content-type ~ text/html [12:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:17] (03Merged) 10jenkins-bot: Enable alternate RevisionSlider slider on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346966 (https://phabricator.wikimedia.org/T160410) (owner: 10WMDE-Fisch) [12:49:26] (03CR) 10jenkins-bot: Enable alternate RevisionSlider slider on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346966 (https://phabricator.wikimedia.org/T160410) (owner: 10WMDE-Fisch) [12:50:35] RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:52:05] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: [[gerrit:346966|Enable alternate RevisionSlider slider on beta]] BETA ONLY (duration: 00m 51s) [12:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:12] (03PS2) 10BBlack: dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937 [12:55:12] (03CR) 10jerkins-bot: [V: 04-1] dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937 (owner: 10BBlack) [12:56:00] 06Operations: Reimage achernar and amacar to jessie - https://phabricator.wikimedia.org/T155411#3163656 (10BBlack) p:05Normal>03High This is going to block deploying edns-client-subnet -enabled recdns packages (requires jessie), which is important for the DC switching stuff. Perhaps we can squeeze this in n... [12:56:50] (03PS4) 10Volans: Add dry-run mode and uses it [switchdc] - 10https://gerrit.wikimedia.org/r/346968 (https://phabricator.wikimedia.org/T160178) [12:57:30] (03PS3) 10BBlack: dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937 [12:58:01] (03CR) 10Volans: "see inline" (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/346968 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [12:58:21] 06Operations: Reimage achernar and amacar to jessie - https://phabricator.wikimedia.org/T155411#3163678 (10MoritzMuehlenhoff) What's the status of T154759? The last we had a DNS recursor down, this led to various problems ( I don't remember all the details, though) [13:00:25] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [13:00:44] (03CR) 10Giuseppe Lavagetto: [C: 031] Add dry-run mode and uses it [switchdc] - 10https://gerrit.wikimedia.org/r/346968 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:04:21] 06Operations, 06Analytics-Kanban, 10netops, 15User-Elukey: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3163723 (10elukey) [13:04:57] (03CR) 10Hashar: "recheck" [debs/contenttranslation/apertium-spa-cat] - 10https://gerrit.wikimedia.org/r/346780 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry) [13:05:43] (03CR) 10jerkins-bot: [V: 04-1] apertium-spa-cat: New upstream release [debs/contenttranslation/apertium-spa-cat] - 10https://gerrit.wikimedia.org/r/346780 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry) [13:06:02] hashar: the magic jerkins bot says no :D [13:06:12] (03CR) 10Volans: [C: 032] Add dry-run mode and uses it [switchdc] - 10https://gerrit.wikimedia.org/r/346968 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:07:29] (03CR) 10Hashar: "Fails with:" [debs/contenttranslation/apertium-spa-cat] - 10https://gerrit.wikimedia.org/r/346780 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry) [13:08:53] 06Operations, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947#3163748 (10MoritzMuehlenhoff) A jessie backport of Pango 1.40.4 for jessie is now available at https://people.wikimedia.org/~jmm/pango/ [13:10:07] (03PS4) 10BBlack: dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937 [13:10:09] (03PS1) 10BBlack: dnsrecursor: update to backports for transition [puppet] - 10https://gerrit.wikimedia.org/r/346980 [13:12:07] (03Abandoned) 10BBlack: cp1008: test recdns rather than authdns for now [puppet] - 10https://gerrit.wikimedia.org/r/346821 (owner: 10BBlack) [13:28:28] (03PS5) 10BBlack: dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937 [13:28:30] (03PS2) 10BBlack: dnsrecursor: update to backports for transition [puppet] - 10https://gerrit.wikimedia.org/r/346980 [13:29:39] (03CR) 10jerkins-bot: [V: 04-1] dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937 (owner: 10BBlack) [13:29:46] (03CR) 10jerkins-bot: [V: 04-1] dnsrecursor: update to backports for transition [puppet] - 10https://gerrit.wikimedia.org/r/346980 (owner: 10BBlack) [13:32:28] 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3163838 (10MoritzMuehlenhoff) Ok, thanks for the explanation! The problem is somewhere in "Google Cloud Directory Sync", then. It appears as if moving a user to a diff... [13:33:55] 06Operations, 06Labs: create a 'root' group strictly for labs/cloud services infrastructure - https://phabricator.wikimedia.org/T162404#3163839 (10chasemp) @MoritzMuehlenhoff (quoting you here just because gerrit sucks for these things) > That's fine. But IMO the access::groups host definitions in Hiera shou... [13:34:45] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 599903 [13:34:49] 06Operations, 10netops: Slight packet loss observed on the network starting Nov 2016 - https://phabricator.wikimedia.org/T154507#3163841 (10elukey) ``` 14:49 not sure if this makes any sense but I did the following 14:49 mtr tegmen.wikimedia.org from netmon1001 14:50 (one of the t... [13:36:33] 06Operations, 06Office-IT, 07LDAP: Remove disabled users from internal mailing lists - https://phabricator.wikimedia.org/T161004#3163844 (10MoritzMuehlenhoff) >>! In T161004#3152956, @bbogaert wrote: > > Sometimes mail is forwarded to another address, sometimes it is shut off completely. Most of the time we... [13:37:14] (03PS6) 10BBlack: dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937 [13:37:17] (03PS3) 10BBlack: dnsrecursor: update to backports for transition [puppet] - 10https://gerrit.wikimedia.org/r/346980 [13:37:55] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 583932 [13:40:05] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 723745 [13:43:08] (03PS1) 10Elukey: Add explicit Xmx and Xms settings to Hadoop MRHS and Namenode [puppet] - 10https://gerrit.wikimedia.org/r/346983 (https://phabricator.wikimedia.org/T159219) [13:43:56] (03PS1) 10Andrew Bogott: fullstack: Switch to the normal jessie image [puppet] - 10https://gerrit.wikimedia.org/r/346984 [13:44:35] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 630 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3090258 keys, up 14 days 21 hours - replication_delay is 630 [13:45:02] (03CR) 10jerkins-bot: [V: 04-1] Add explicit Xmx and Xms settings to Hadoop MRHS and Namenode [puppet] - 10https://gerrit.wikimedia.org/r/346983 (https://phabricator.wikimedia.org/T159219) (owner: 10Elukey) [13:48:15] PROBLEM - Check Varnish expiry mailbox lag on cp1063 is CRITICAL: CRITICAL: expiry mailbox lag is 582004 [13:49:45] (03PS2) 10Elukey: Add explicit Xmx and Xms settings to Hadoop MRHS and Namenode [puppet] - 10https://gerrit.wikimedia.org/r/346983 (https://phabricator.wikimedia.org/T159219) [13:52:16] 06Operations: Reimage achernar and amacar to jessie - https://phabricator.wikimedia.org/T155411#3163902 (10BBlack) >>! In T155411#3163678, @MoritzMuehlenhoff wrote: > What's the status of T154759? The last we had a DNS recursor down, this led to various problems ( I don't remember all the details, though) I thi... [13:52:36] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3090154 keys, up 14 days 21 hours - replication_delay is 0 [13:54:22] 06Operations: Reimage achernar and amacar to jessie - https://phabricator.wikimedia.org/T155411#3163907 (10MoritzMuehlenhoff) Sure, we can just give it a try. I'll look into it next week. [13:55:13] (03PS4) 10Alexandros Kosiaris: Move role::backup::host into a profile [puppet] - 10https://gerrit.wikimedia.org/r/346732 [13:55:15] (03PS1) 10Alexandros Kosiaris: Make role::backup::director a profile [puppet] - 10https://gerrit.wikimedia.org/r/346992 [13:57:55] (03PS1) 10Alexandros Kosiaris: Add profile::backup::director::dbpass dummy pass [labs/private] - 10https://gerrit.wikimedia.org/r/346993 [13:58:16] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add profile::backup::director::dbpass dummy pass [labs/private] - 10https://gerrit.wikimedia.org/r/346993 (owner: 10Alexandros Kosiaris) [13:58:45] (03PS1) 10Giuseppe Lavagetto: Add dry-run to redis, remove unneeded puppet actions [switchdc] - 10https://gerrit.wikimedia.org/r/346994 [13:58:51] I like how one bot got the submit and the other bot got the upload [13:59:10] how do we restart this thing again ? according to wikitech it still on SGE ? [13:59:33] but there is a project in tool labs kubernetes with a pod running ? [13:59:36] what the ... [14:01:22] !log switchdc (oblivian@sarin) Executing task switchdc.stages.t00_disable_puppet(eqiad, codfw): Stop puppet execution on maintenance, jobqueues [14:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:42] <_joe_> oh sorry [14:01:55] <_joe_> !log running tests of the switchdc automation in dry-run mode [14:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:05] <_joe_> volans: we might want to change that too [14:02:09] (03CR) 10Elukey: [C: 032] Add explicit Xmx and Xms settings to Hadoop MRHS and Namenode [puppet] - 10https://gerrit.wikimedia.org/r/346983 (https://phabricator.wikimedia.org/T159219) (owner: 10Elukey) [14:02:18] <_joe_> no reason to !log if we're running a dry-run [14:02:23] !log switchdc (oblivian@sarin) Executing task switchdc.stages.t00_reduce_ttl(eqiad, codfw): Reduce the TTL of all the MediaWiki discovery records [14:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:35] _joe_: sure, let me fix it [14:04:25] 06Operations, 10Ops-Access-Requests: Write access to Icinga - https://phabricator.wikimedia.org/T159564#3163924 (10cwdent) 05Resolved>03Open @robh - I finally got around to trying to ack something but still got "Not authorized" logged in as cdentinger - not cwdent as my login in is here and a couple other... [14:05:34] (03PS1) 10Volans: Dry-run: do not notify IRC/SAL [switchdc] - 10https://gerrit.wikimedia.org/r/346999 (https://phabricator.wikimedia.org/T160178) [14:05:57] _joe_: ^^^ [14:06:44] (03CR) 10Giuseppe Lavagetto: [C: 032] Dry-run: do not notify IRC/SAL [switchdc] - 10https://gerrit.wikimedia.org/r/346999 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:07:34] !log restart hadoop-mapreduce-historyserver on an1001 to pick up the new jvm settings [14:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:25] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:12:39] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review, and 2 others: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3163940 (10Halfak) Hey folks. I just realized that I had the following message in sitting in ph... [14:13:42] !log restart hadoop-hdfs-namenode on an1002 (Hadoop Master standby) to pick up new jvm settings [14:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:43] (03PS7) 10BBlack: dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937 [14:18:45] (03PS4) 10BBlack: dnsrecursor: update to backports for transition [puppet] - 10https://gerrit.wikimedia.org/r/346980 [14:19:17] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3163952 (10MoritzMuehlenhoff) [14:20:11] (03CR) 10jerkins-bot: [V: 04-1] dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937 (owner: 10BBlack) [14:20:11] 06Operations, 10netops: cr2-knams<->asw-esams GBLX fiber down - https://phabricator.wikimedia.org/T158647#3163967 (10ayounsi) Attaching extra informations from Vancis staff about the X-connect: {F7344302} [14:20:12] (03CR) 10jerkins-bot: [V: 04-1] dnsrecursor: update to backports for transition [puppet] - 10https://gerrit.wikimedia.org/r/346980 (owner: 10BBlack) [14:23:35] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:27:24] (03PS8) 10BBlack: dnsrecursor: 4.x backport and edns-client-subnet [puppet] - 10https://gerrit.wikimedia.org/r/346937 [14:27:26] (03PS5) 10BBlack: dnsrecursor: update to backports for transition [puppet] - 10https://gerrit.wikimedia.org/r/346980 [14:32:31] (03CR) 10KartikMistry: "> Fails with:" [debs/contenttranslation/apertium-spa-cat] - 10https://gerrit.wikimedia.org/r/346780 (https://phabricator.wikimedia.org/T161511) (owner: 10KartikMistry) [14:34:13] (03CR) 10BBlack: [C: 031] "This looks like it will work in manual testing + compiler outputs on trusty/jessie. Needs careful deployment together with the followup p" [puppet] - 10https://gerrit.wikimedia.org/r/346937 (owner: 10BBlack) [14:37:05] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage the Hadoop Cluster to Debian Jessie - https://phabricator.wikimedia.org/T160333#3164009 (10elukey) [14:37:15] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:39:25] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [14:40:00] !log Deploy  schema change db1033 (already depooled) (s7) - T160390 [14:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:08] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [14:40:31] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage the Hadoop Cluster to Debian Jessie - https://phabricator.wikimedia.org/T160333#3164016 (10elukey) Status: * All worker nodes except analytics1030 (down for hw failures) have Debian Jessie * Some worker nodes needs to be rebooted to pick up the Linux... [14:41:21] (03PS1) 10Alexandros Kosiaris: Make role::backup::storage a profile [puppet] - 10https://gerrit.wikimedia.org/r/347003 [14:51:35] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:52:22] (03PS5) 10Alexandros Kosiaris: Move role::backup::host into a profile [puppet] - 10https://gerrit.wikimedia.org/r/346732 [14:52:24] (03PS2) 10Alexandros Kosiaris: Make role::backup::director a profile [puppet] - 10https://gerrit.wikimedia.org/r/346992 [14:52:26] (03PS2) 10Alexandros Kosiaris: Make role::backup::storage a profile [puppet] - 10https://gerrit.wikimedia.org/r/347003 [14:53:02] (03CR) 10Alexandros Kosiaris: [C: 032] "a cross fleet PCC seems ok, merging" [puppet] - 10https://gerrit.wikimedia.org/r/346732 (owner: 10Alexandros Kosiaris) [14:53:06] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Move role::backup::host into a profile [puppet] - 10https://gerrit.wikimedia.org/r/346732 (owner: 10Alexandros Kosiaris) [14:53:20] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Make role::backup::director a profile [puppet] - 10https://gerrit.wikimedia.org/r/346992 (owner: 10Alexandros Kosiaris) [14:53:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Make role::backup::storage a profile [puppet] - 10https://gerrit.wikimedia.org/r/347003 (owner: 10Alexandros Kosiaris) [14:55:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:58:11] RECOVERY - Check Varnish expiry mailbox lag on cp1063 is OK: OK: expiry mailbox lag is 0 [14:58:22] !log Deploy schema change dbstore1001 (s7 wikis) - T160390 [14:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:28] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [14:59:37] 06Operations, 10ops-eqiad: decommission ms1003 - https://phabricator.wikimedia.org/T157975#3164039 (10Cmjohnson) @ArielGlenn ms1003 is now wiped and removed from rack destroy any and all things! Thanks [15:00:51] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [15:03:31] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:04:28] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 15User-Elukey: Reimage the Hadoop Cluster to Debian Jessie - https://phabricator.wikimedia.org/T160333#3164054 (10Nuria) [15:04:42] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:05:01] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:05:31] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:05:38] (03PS1) 10Gehel: [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [15:06:49] (03CR) 10jerkins-bot: [V: 04-1] [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 (owner: 10Gehel) [15:08:51] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:09:19] (03PS2) 10Gehel: [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [15:09:51] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:10:31] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:15:11] (03PS1) 10Alexandros Kosiaris: Fix typo with role::backup::host [puppet] - 10https://gerrit.wikimedia.org/r/347007 [15:16:41] PROBLEM - puppet last run on wezen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:19:41] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:20:41] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:21:31] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:22:29] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review, and 2 others: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3164081 (10mobrovac) What is the status of {T148714} ? Has it been deployed and tested? If so, w... [15:23:40] (03PS1) 10Alexandros Kosiaris: Include profile::backup::host in respective roles [puppet] - 10https://gerrit.wikimedia.org/r/347008 [15:23:41] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:24:32] (03CR) 10Alexandros Kosiaris: [C: 032] Fix typo with role::backup::host [puppet] - 10https://gerrit.wikimedia.org/r/347007 (owner: 10Alexandros Kosiaris) [15:24:39] (03CR) 10Alexandros Kosiaris: [C: 032] Include profile::backup::host in respective roles [puppet] - 10https://gerrit.wikimedia.org/r/347008 (owner: 10Alexandros Kosiaris) [15:24:42] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Include profile::backup::host in respective roles [puppet] - 10https://gerrit.wikimedia.org/r/347008 (owner: 10Alexandros Kosiaris) [15:25:11] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:25:41] PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:31] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:26:41] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:21] (03PS2) 10Giuseppe Lavagetto: Add dry-run to redis, remove unneeded puppet actions [switchdc] - 10https://gerrit.wikimedia.org/r/346994 [15:29:23] (03PS1) 10Giuseppe Lavagetto: A few bugfixes [switchdc] - 10https://gerrit.wikimedia.org/r/347009 [15:29:44] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2739503 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2020.codfw.wmnet'... [15:29:47] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3164088 (10bd808) >>! In T125735#3163049, @elukey wrote: > So if I got it correctly, a... [15:31:41] RECOVERY - puppet last run on dubnium is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:32:17] !log reimaging elstic2020 - T149006 [15:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:24] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [15:32:41] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:34:30] thanks bd808 :) [15:34:48] I put some thoughts in the task, let me know what you think about them [15:34:57] I'll try to review also the hhvm logs [15:37:51] RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:37:51] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:45:41] RECOVERY - puppet last run on wezen is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:47:02] (03CR) 10Mobrovac: [C: 04-1] "Also needs a line in hieradata/labs/deployment-prep/common.yaml - https://github.com/wikimedia/puppet/blob/a25081633a5573744ab33503527f52e" [puppet] - 10https://gerrit.wikimedia.org/r/345826 (https://phabricator.wikimedia.org/T159615) (owner: 10Alexandros Kosiaris) [15:48:41] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:49:31] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:53:41] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:54:11] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:54:54] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Elukey: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#3164110 (10MoritzMuehlenhoff) That makes sense, but I'm wondering if we can replace one of the hosts prior to the switchover, so that we test it with Linux 4.9... [15:56:11] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:20] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3164111 (10Andrew) p:05Triage>03Normal a:03Andrew [15:57:21] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Elukey: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#3164113 (10elukey) Yes we definitely can, the procedure is written above and we can target mc1019 (decommissioning mc1001). Maybe we can schedule it for next... [16:07:51] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 10202 [16:10:46] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Elukey: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#3164153 (10MoritzMuehlenhoff) Sounds good to me! [16:12:23] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3164156 (10Marostegui) [16:14:21] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 844.85 seconds [16:14:44] checking [16:18:55] categorylinks activity, maybe? [16:19:02] it was the analyst I think [16:19:06] I am writing him [16:19:08] nope [16:19:27] nothing seems broken, so at least there is that [16:19:52] because I killed his query :-) [16:20:04] he sent me an email an hour ago with that same query [16:20:06] :-) [16:20:22] dbstore1002 lag is low prio [16:20:35] it would never end :) [16:20:54] better if it is solved, but analytics queries normally do queries on the last days or weeks [16:21:14] he was reaching max_allowed_packet too this time [16:21:15] ok, if it is infinite lag, yes, not good :-) [16:21:32] yeah, lag would be infinite this case XD [16:21:43] I will handle this new query with him [16:22:11] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:24:21] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 216.78 seconds [16:28:11] (03PS1) 10Alexandros Kosiaris: Fix role::mediawiki::maintenance backup::host inclusion [puppet] - 10https://gerrit.wikimedia.org/r/347016 [16:29:14] !log demon@tin Synchronized php-1.29.0-wmf.19/extensions/SyntaxHighlight_GeSHi/: no-op, cleaning up history (duration: 00m 44s) [16:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:45] akosiaris: ooh, you replaced all the role::backup::host includes already. cool, i was always wondering where to put it, this way it's clear now [16:39:21] (03PS1) 10Cmjohnson: Adding dhcpd entries for analytics 1058-1068, 1068 mac address is not defined yet. Updating netboot.cfg file for installs T162216 [puppet] - 10https://gerrit.wikimedia.org/r/347020 [16:39:24] mutante: yeah I started doing some bacula changes for retention and schedules [16:39:39] and then realized the entire structure of the thing required some changes to make it up to date [16:39:42] (03CR) 10Dzahn: "include ::profile::backup::host with leading :: ?" [puppet] - 10https://gerrit.wikimedia.org/r/347016 (owner: 10Alexandros Kosiaris) [16:40:04] akosiaris: ^ do we always want the :: ? [16:40:28] it's a bit mixed [16:40:53] yeah I 've been thiking about that [16:41:08] i have been adding it in a couple cases [16:41:16] we had a volunteer trying to get that throughout the repo [16:41:34] maybe it just makes more sense [16:41:35] JuniorSys? [16:41:38] I 'll amend [16:41:38] i merged stuff there [16:41:41] yes [16:41:48] i think i merged his stuff but there is just more [16:42:10] well, without a tool in CI enforcing that behavior [16:42:15] there isn't much point [16:42:23] we can have that tool [16:42:27] things are always gonna deteriorate after a cleanup [16:42:29] by adding a regex to "typos" [16:42:50] That, or teaching the linter/stylechecker to look at it [16:42:57] (03PS2) 10Alexandros Kosiaris: Fix role::mediawiki::maintenance backup::host inclusion [puppet] - 10https://gerrit.wikimedia.org/r/347016 [16:42:59] Either is doable, typos regex probably easier [16:43:04] a regex that matches "include" NOT followed by " ::" [16:43:04] I hadn't though of the "typos" approach [16:43:14] it's a neat idea [16:43:36] I would support it [16:43:40] this made me think of it https://gerrit.wikimedia.org/r/#/c/346677/ [16:43:42] akosiaris: It's how we also enforce people from putting invalid ranges of mw* servers. So you can't write mw2001.eqiad.wmnet (since 2001 would be in codfw) [16:43:43] :) [16:44:03] :-) [16:44:05] Er it might be all number ranges, not just mw* either [16:44:30] (? RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 12051 [16:45:15] (03CR) 10Alexandros Kosiaris: [C: 032] "thanks for the comment! Addressed and merging" [puppet] - 10https://gerrit.wikimedia.org/r/347016 (owner: 10Alexandros Kosiaris) [16:45:29] :) gotta get some breakfast. cu [16:46:09] (03CR) 10Cmjohnson: [C: 032] Adding dhcpd entries for analytics 1058-1068, 1068 mac address is not defined yet. Updating netboot.cfg file for installs T162216 [puppet] - 10https://gerrit.wikimedia.org/r/347020 (owner: 10Cmjohnson) [16:46:20] (03PS2) 10Cmjohnson: Adding dhcpd entries for analytics 1058-1068, 1068 mac address is not defined yet. Updating netboot.cfg file for installs T162216 [puppet] - 10https://gerrit.wikimedia.org/r/347020 [16:48:32] (03CR) 10Cmjohnson: [V: 032 C: 032] Adding dhcpd entries for analytics 1058-1068, 1068 mac address is not defined yet. Updating netboot.cfg file for installs T162216 [puppet] - 10https://gerrit.wikimedia.org/r/347020 (owner: 10Cmjohnson) [16:48:41] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:50:10] cmjohnson1: \o/ [16:51:27] elukey: raid 1 on the ssds and individual raid 0 on the 12 spinning disks ...right? [16:51:41] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:51:54] !log demon@tin Started scap: no-op, cleaning up wmf.19 history [16:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:42] cmjohnson1: yep yep same recipe as the others [16:56:52] cool [16:56:58] thank you [16:57:11] 06Operations, 10Mail, 10Wikimedia-Mailing-lists: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3164199 (10Nemo_bis) >>! In T160529#3162334, @KTC wrote: > AFAIK, from the list members email server point of view, any SPF check will pass since it's checking WMF's mailman server. Indeed, see... [16:58:04] (03PS2) 10Alexandros Kosiaris: changeprop: Add an ores_uris parameter [puppet] - 10https://gerrit.wikimedia.org/r/345826 (https://phabricator.wikimedia.org/T159615) [17:03:26] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup first 10 DB servers - https://phabricator.wikimedia.org/T162159#3164204 (10Papaul) [17:05:20] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3164217 (10Andrew) This is from the pinning in https://gerrit.wikimedia.org/r/#/c/300870/ -- jessie-backports has upgraded their puppet package which causes conflicts. [17:09:03] (03CR) 10Volans: [C: 031] "Look sane to me." [switchdc] - 10https://gerrit.wikimedia.org/r/346994 (owner: 10Giuseppe Lavagetto) [17:14:43] (03PS1) 10Dzahn: standardize on include ::profile::backup::host [puppet] - 10https://gerrit.wikimedia.org/r/347022 [17:15:24] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3164236 (10Andrew) We can add 3.8 packages to reprepro but that will affect puppet clients as well as masters, which we might not want. [17:17:01] !log demon@tin Finished scap: no-op, cleaning up wmf.19 history (duration: 25m 07s) [17:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:35] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3164242 (10faidon) My suggestion, which needs a little more time to be fully tested is: - Take the latest 3.8 jessie-backport (from snapshot.debian.org), 3.8.5-2~bpo8+1, and put it in... [17:21:33] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3164263 (10Joe) Our production puppetmasters run on 3.8, several clients have been tested, and the agent should have minimal differences. I can take a look back at the changelog for... [17:25:33] (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/347009 (owner: 10Giuseppe Lavagetto) [17:26:18] (03PS1) 10Dzahn: standardize "include ::profile:*", "include ::nrpe" [puppet] - 10https://gerrit.wikimedia.org/r/347023 [17:30:35] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3164332 (10dr0ptp4kt) That did the trick. Thanks! [17:30:43] (03PS1) 10Dzahn: standardize "include ::base:*" [puppet] - 10https://gerrit.wikimedia.org/r/347024 [17:33:30] (03CR) 10Giuseppe Lavagetto: [C: 032] Add dry-run to redis, remove unneeded puppet actions [switchdc] - 10https://gerrit.wikimedia.org/r/346994 (owner: 10Giuseppe Lavagetto) [17:36:18] (03CR) 10Giuseppe Lavagetto: [C: 032] A few bugfixes [switchdc] - 10https://gerrit.wikimedia.org/r/347009 (owner: 10Giuseppe Lavagetto) [17:36:22] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops: rack/setup frbackup2001 - https://phabricator.wikimedia.org/T162469#3164345 (10RobH) [17:43:29] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops: rack/setup frbackup2001 - https://phabricator.wikimedia.org/T162469#3164368 (10Jgreen) [17:57:51] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [18:07:51] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [18:16:14] !log demon@tin Synchronized php-1.29.0-wmf.19/includes/api/: No-op, cleaning up git history (duration: 00m 54s) [18:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:28] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3164407 (10Dzahn) 05Open>03Resolved a:03Dzahn ok, cool :) thanks for confirming and your patience. resolving ticket [18:19:58] (03Abandoned) 10Niharika29: Disable logins on loginwiki to support LoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333653 (https://phabricator.wikimedia.org/T154064) (owner: 10Niharika29) [18:25:45] 06Operations, 10Ops-Access-Requests: Write access to Icinga - https://phabricator.wikimedia.org/T159564#3071446 (10Dzahn) In LDAP there is just 1 user "cwdent" but it has a **uid** "cwdent", with **sn** and **cn** "Cdentinger". The Icinga contact name is cwdent but i think it's the sn it has to match. [18:39:26] (03CR) 10Dzahn: etherpad: convert to profile/role structure (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346921 (owner: 10Dzahn) [18:40:07] (03PS2) 10Dzahn: etherpad: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346921 [18:40:38] !log demon@tin Synchronized php-1.29.0-wmf.19/includes/specials/: no-op, cleaning up history (duration: 01m 00s) [18:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:22] (03CR) 10jerkins-bot: [V: 04-1] etherpad: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346921 (owner: 10Dzahn) [18:46:02] (03PS3) 10Dzahn: etherpad: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346921 [18:47:04] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3164498 (10dr0ptp4kt) Thank you as well for your patience. I forgot about that delegation distinction (assuming there has always been one) ! [18:53:11] (03PS3) 10Gehel: [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [19:00:01] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 81931 [19:03:01] (03CR) 10Mobrovac: [C: 04-1] changeprop: Add an ores_uris parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345826 (https://phabricator.wikimedia.org/T159615) (owner: 10Alexandros Kosiaris) [19:04:30] (03PS1) 10Cmjohnson: Adding production dns for analytics1058-68 T162216 [dns] - 10https://gerrit.wikimedia.org/r/347042 [19:05:00] (03CR) 10Cmjohnson: [C: 032] Adding production dns for analytics1058-68 T162216 [dns] - 10https://gerrit.wikimedia.org/r/347042 (owner: 10Cmjohnson) [19:16:01] (03PS3) 10Catrope: Enable RCFilters beta feature on fawiki, ruwiki, trwiki, and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343438 [19:16:03] (03PS3) 10Catrope: Enable RCFilters beta feature on all wikis except wikidatawiki, nlwiki, cswiki and etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343439 [19:16:06] (03PS1) 10Catrope: Enable RCFilters beta feature on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347045 [19:16:25] (03CR) 10Catrope: [C: 04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347045 (owner: 10Catrope) [19:18:20] (03PS1) 10Ayounsi: Fix 404 errors for smokeping cropper [puppet] - 10https://gerrit.wikimedia.org/r/347046 [19:18:53] !log demon@tin Started scap: no-op, final history sync [19:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:45] (03CR) 10BBlack: [C: 031] Fix 404 errors for smokeping cropper [puppet] - 10https://gerrit.wikimedia.org/r/347046 (owner: 10Ayounsi) [19:21:17] (03CR) 10Ayounsi: [C: 032] Fix 404 errors for smokeping cropper [puppet] - 10https://gerrit.wikimedia.org/r/347046 (owner: 10Ayounsi) [19:21:31] PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:29:22] (03PS1) 10Dzahn: nagios_common: rename cwdent to cdentinger [puppet] - 10https://gerrit.wikimedia.org/r/347050 (https://phabricator.wikimedia.org/T159564) [19:32:42] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Write access to Icinga - https://phabricator.wikimedia.org/T159564#3164684 (10Dzahn) a:05RobH>03Dzahn [19:33:31] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Icinga contact/permissions for cwdent (cdentinger) - https://phabricator.wikimedia.org/T159564#3071446 (10Dzahn) [19:33:54] (03PS2) 10Dzahn: nagios_common: rename cwdent to cdentinger [puppet] - 10https://gerrit.wikimedia.org/r/347050 (https://phabricator.wikimedia.org/T159564) [19:34:11] (03CR) 10Dzahn: [V: 032 C: 032] nagios_common: rename cwdent to cdentinger [puppet] - 10https://gerrit.wikimedia.org/r/347050 (https://phabricator.wikimedia.org/T159564) (owner: 10Dzahn) [19:34:39] (03CR) 10Dzahn: "merging together with change in private repo to avoid breaking Icinga config (renaming contact)" [puppet] - 10https://gerrit.wikimedia.org/r/347050 (https://phabricator.wikimedia.org/T159564) (owner: 10Dzahn) [19:40:32] (03PS1) 10Andrew Bogott: Revert "Nova dnsmasq: Reduce lease times and ttls by a lot" [puppet] - 10https://gerrit.wikimedia.org/r/347052 [19:40:34] (03PS1) 10Andrew Bogott: nova.conf: Reduce fixed_ip_disassociate_timeout to three minutes. [puppet] - 10https://gerrit.wikimedia.org/r/347053 (https://phabricator.wikimedia.org/T160908) [19:40:36] (03PS1) 10Andrew Bogott: nova.conf: change dhcp lease times to 12 hours. [puppet] - 10https://gerrit.wikimedia.org/r/347054 (https://phabricator.wikimedia.org/T160908) [19:41:24] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Icinga contact/permissions for cwdent (cdentinger) - https://phabricator.wikimedia.org/T159564#3164730 (10Dzahn) @cwdent I renamed your Icinga contact to "cdentinger". Please log out of Icinga and login again, using "cdentinger" (not capitalized, it wo... [19:41:49] (03PS1) 10Chad: donatewiki back to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347056 (https://phabricator.wikimedia.org/T162300) [19:41:58] !log demon@tin Finished scap: no-op, final history sync (duration: 23m 05s) [19:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:10] (03CR) 10Chad: [C: 04-2] donatewiki back to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347056 (https://phabricator.wikimedia.org/T162300) (owner: 10Chad) [19:42:40] 06Operations, 10Ops-Access-Requests: Icinga contact/permissions for cwdent (cdentinger) - https://phabricator.wikimedia.org/T159564#3164736 (10Dzahn) [19:43:03] 06Operations, 10Ops-Access-Requests: Icinga contact/permissions for cwdent (cdentinger) - https://phabricator.wikimedia.org/T159564#3071446 (10Dzahn) p:05Triage>03Normal [19:44:15] (03PS1) 10Chad: Scap clean: Also delete empty directories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347057 [19:44:25] (03CR) 10Chad: "This may not even work." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347057 (owner: 10Chad) [19:49:20] (03CR) 10Andrew Bogott: [C: 032] Revert "Nova dnsmasq: Reduce lease times and ttls by a lot" [puppet] - 10https://gerrit.wikimedia.org/r/347052 (owner: 10Andrew Bogott) [19:50:29] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [20:03:02] 06Operations, 10Mail, 10Wikimedia-Mailing-lists: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3164792 (10grin) >>! In T160529#3162334, @KTC wrote: >>>! In T160529#3161835, @grin wrote: >> Dropping/autorejecting email with matching header >> `​X-Spam-Score: .+\+\+\+\+\+` >> (which is abov... [20:05:59] !log demon@tin Synchronized README: no-op, co-master sync (duration: 00m 39s) [20:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:15] !log demon@tin Synchronized README: no-op, testing master sync speed now (duration: 00m 38s) [20:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:45] 06Operations, 10Mail, 10Wikimedia-Mailing-lists: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3164796 (10grin) >>! In T160529#3164199, @Nemo_bis wrote: >>>! In T160529#3162334, @KTC wrote: >> AFAIK, from the list members email server point of view, any SPF check will pass since it's chec... [20:13:42] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240511 (10grin) Just as a sidenote: be aware that wildcards are only wildcard **one** level up, not **any**; `*.wikimedia.org` matches robh.wikimedia.org but not server01.robh.wikimedia.org (... [20:18:50] (03PS1) 10Chad: Scap clean: Log to IRC when we prune a branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347061 [20:21:51] (03CR) 10Dzahn: [C: 032] standardize on include ::profile::backup::host [puppet] - 10https://gerrit.wikimedia.org/r/347022 (owner: 10Dzahn) [20:21:55] (03PS2) 10Dzahn: standardize on include ::profile::backup::host [puppet] - 10https://gerrit.wikimedia.org/r/347022 [20:24:56] 06Operations, 06Labs, 10procurement: eqiad: (2) hardware access request for labvirt1019 and labvirt1020 (refresh) - https://phabricator.wikimedia.org/T162486#3164839 (10chasemp) p:05Triage>03Normal a:03RobH [20:25:32] 06Operations, 06Labs, 10procurement: eqiad: (2) hardware access request for labvirt1019 and labvirt1020 (refresh) - https://phabricator.wikimedia.org/T162486#3164827 (10chasemp) [20:25:34] 06Operations, 10DBA, 06Labs: eqiad: (2) hardware access request for labsdb1006 & 7 refresh - https://phabricator.wikimedia.org/T161755#3164860 (10chasemp) [20:25:36] 06Operations, 10DBA, 06Labs: eqiad: (2) hardware access request for labsdb1004 & 5 refresh - https://phabricator.wikimedia.org/T161754#3164861 (10chasemp) [20:25:49] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:32] (03PS1) 10Dzahn: typos: make "include profile::backup" a typo [puppet] - 10https://gerrit.wikimedia.org/r/347064 [20:29:07] (03CR) 10Dzahn: "follow-up: https://gerrit.wikimedia.org/r/#/c/347064/" [puppet] - 10https://gerrit.wikimedia.org/r/347022 (owner: 10Dzahn) [20:29:51] (03CR) 10jerkins-bot: [V: 04-1] typos: make "include profile::backup" a typo [puppet] - 10https://gerrit.wikimedia.org/r/347064 (owner: 10Dzahn) [20:33:59] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:34:50] (03PS1) 10Dzahn: backup: add leading :: in usage example [puppet] - 10https://gerrit.wikimedia.org/r/347078 [20:34:59] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:35:12] (03CR) 10Dzahn: "needed one follow-up https://gerrit.wikimedia.org/r/347078 which i didn't catch because it wasn't in a .pp file" [puppet] - 10https://gerrit.wikimedia.org/r/347064 (owner: 10Dzahn) [20:35:39] (03CR) 10Dzahn: "+ https://gerrit.wikimedia.org/r/347078" [puppet] - 10https://gerrit.wikimedia.org/r/347022 (owner: 10Dzahn) [20:36:06] (03CR) 10Dzahn: [C: 032] backup: add leading :: in usage example [puppet] - 10https://gerrit.wikimedia.org/r/347078 (owner: 10Dzahn) [20:36:12] 06Operations, 10Mail, 10Wikimedia-Mailing-lists: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3102305 (10Platonides) We should definitely reject emails failing SPF. Much less forward that to mailing lists (forwarding to list owners //might// be acceptable, although I wouldn't recommend t... [20:36:42] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/347064 (owner: 10Dzahn) [20:40:48] (03CR) 10Dzahn: "do we really want to remove mod_version completely? i think i would have kept that but removed all the rest this does" [puppet] - 10https://gerrit.wikimedia.org/r/345548 (owner: 10Faidon Liambotis) [20:42:43] (03CR) 10Zppix: "Shouldnt the Apache 2.2 be apart of a separate patch just in case we need to revert one but not the other?" [puppet] - 10https://gerrit.wikimedia.org/r/345548 (owner: 10Faidon Liambotis) [20:44:45] (03CR) 10Dzahn: "@Zppix it should all be the same thing. it's 2.4 in all other distros we used" [puppet] - 10https://gerrit.wikimedia.org/r/345548 (owner: 10Faidon Liambotis) [20:45:39] (03CR) 10Dzahn: "(it's also 2.4 in trusty http://packages.ubuntu.com/search?suite=trusty&searchon=names&keywords=apache)" [puppet] - 10https://gerrit.wikimedia.org/r/345548 (owner: 10Faidon Liambotis) [20:46:11] who is that can speak russian? [20:46:15] (03PS4) 10Chad: Create hourly backup schedule, modeled on weekly and use for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/341371 [20:46:32] Zppix: what do you need in Russian? [20:46:56] lol, Translator? [20:46:57] mutante: t162483 the bug reporter english isnt great and is native to russian lang so i thought that could be helpful [20:47:03] (03PS3) 10Dzahn: interface: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345564 (owner: 10Faidon Liambotis) [20:47:24] Zppix: MaxSem if he is around iirc [20:47:25] T162483 [20:47:26] T162483: Missing thumbnail image on Commons - https://phabricator.wikimedia.org/T162483 [20:48:27] http://savepic.ru/13479534.png [20:48:29] I mean it's pretty clear what's happening, even if nobody can reproduce yet. [20:48:36] this is supposed to described the bug (the image above) [20:48:37] And MatmaRex pointed to the likely issue [20:48:39] * RainbowSprinkles shrugs [20:50:38] alright, i was just trying to be helpful i saw that task and saw that reporters english isnt easy for them, so i thought i say something. [20:51:47] Zppix: thanks, we'll need them to paste the error they get. i cant reproduce it either, just like what Andre says [20:52:12] the problem is pretty clear I think [20:52:14] mutante: no problem, I just thought them having something in their native tongue would be easier than using translate [20:52:28] (03CR) 10Dzahn: [C: 032] interface: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345564 (owner: 10Faidon Liambotis) [20:52:51] (03PS4) 10Dzahn: interface: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345564 (owner: 10Faidon Liambotis) [20:52:51] 06Operations, 10Mail, 10Wikimedia-Mailing-lists: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3165018 (10Nemo_bis) >>! In T160529#3164796, @grin wrote: > It depends on the versions of the components and also on the load of the servers. https://ganglia.wikimedia.org/latest/graph_all_peri... [20:53:04] andre__: i know but it never hurts to double check. :) [20:53:49] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [20:55:35] Zppix: I already intentionally used simple English in my Phab comment [20:55:59] and *if* there's any confusion I'm happy to write in Russian instead. But there is none so far. [20:56:38] andre__: i figured but i thought i'd go out of my way to make sure we get any and all info we can. Its just my personaltiy anyway im done clogging up operations [20:56:51] i wonder if the nick change of FastLizard to werelizard is automated and based on full moon phase [20:57:00] :D [20:57:02] mutante: it may be an afk thing [20:57:17] might be a Herald rule :P [20:57:32] andre__: herald on irc... that would be interesting [20:58:32] (03CR) 10Zppix: "> @Zppix it should all be the same thing. it's 2.4 in all other" [puppet] - 10https://gerrit.wikimedia.org/r/345548 (owner: 10Faidon Liambotis) [21:00:02] (03CR) 10Dzahn: [C: 031] "as jenkins-bot +2 shows, there are no more occurrances of this in the repo now." [puppet] - 10https://gerrit.wikimedia.org/r/347064 (owner: 10Dzahn) [21:00:27] (03PS3) 10Dzahn: Link to Code of Conduct from Phabricator's footer [puppet] - 10https://gerrit.wikimedia.org/r/343749 (owner: 10Aklapper) [21:01:49] (03CR) 10Dzahn: [C: 032] "disclaimer: no personal opinion on the CoC (or the approval process), just a technical change since it was moved from draft to official on" [puppet] - 10https://gerrit.wikimedia.org/r/343749 (owner: 10Aklapper) [21:03:54] (03CR) 10Zppix: [C: 031] typos: make "include profile::backup" a typo [puppet] - 10https://gerrit.wikimedia.org/r/347064 (owner: 10Dzahn) [21:04:28] Zppix: the joke there is that herald has a moon phase trigger in it ;) [21:04:49] bd808: i know i was saying having in irc would be interesting [21:06:10] (03PS2) 10Dzahn: ruthenium: increase parsoid-vd clients from 4 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/346209 (owner: 10Subramanya Sastry) [21:06:29] (03CR) 10Dzahn: "@subbu I'm being bold and just amended to make it "5" instead" [puppet] - 10https://gerrit.wikimedia.org/r/346209 (owner: 10Subramanya Sastry) [21:06:49] mutante, sonds good. :) [21:07:03] (03PS3) 10Dzahn: ruthenium: increase parsoid-vd clients from 4 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/346209 (owner: 10Subramanya Sastry) [21:07:03] jouncebot: next [21:07:04] In 63 hour(s) and 52 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170410T1300) [21:07:05] subbu: :) ok [21:08:22] (03CR) 10Dzahn: [C: 032] ruthenium: increase parsoid-vd clients from 4 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/346209 (owner: 10Subramanya Sastry) [21:08:45] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3165052 (10Aklapper) >>! In T162035#3163583, @Dvorapa wrote: > For me e... [21:08:55] Zppix, ping finally received [21:09:32] MaxSem: sorry i should of sent you an update i think we decided to leave it be but if you want to look for yourself let me give you the link again [21:09:36] subbu: applied on ruthenium [21:09:53] MaxSem: https://phabricator.wikimedia.org/T162483 [21:10:07] subbu: i mean, i ran puppet, but i did not stop/start anything [21:14:30] mutante, thanks.. that is fine. next time we do a new visual diff run, i'll do that restart. [21:26:42] mutante: lol, this is just the nickname I use when I'm playing werewolf :P [21:28:17] subbu: ok, great [21:28:22] werelizard: :) [21:33:07] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3165182 (10Papaul) The HP guy called around 12:30pm to let me know that he was at UPS and haven't received the main board yet they told him that the truck will be there between 1 and 3 pm. The appointme... [21:33:39] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 67905.321472 Seconds [21:33:49] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 67911.427355 Seconds [21:34:19] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 67948.344441 Seconds [21:36:39] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [21:36:49] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [21:37:19] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [21:39:16] (03PS2) 10Dzahn: zookeeper: remove precise's package version [puppet] - 10https://gerrit.wikimedia.org/r/345839 (owner: 10Faidon Liambotis) [21:40:30] 06Operations, 10ops-codfw, 06DC-Ops, 10Traffic: lvs2002 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T126321#3165192 (10Papaul) 05Open>03Resolved closing this task since it is the same Flash card issue in T162099 [21:41:37] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops: rack/setup frbackup2001 - https://phabricator.wikimedia.org/T162469#3165199 (10Papaul) [21:42:05] paladox: whats you working on i saw your paste in phab [21:42:28] Zppix im working on trilead-ssh2 (jenkins) [21:42:44] Trying to resolve T103351 [21:42:45] T103351: Jenkins trilead-ssh2 doesn't support our MAC/KEX algorithms - https://phabricator.wikimedia.org/T103351 [21:43:09] Luckly that may be fixed next week with the release of trilead-ssh2 (new version) :) [21:44:28] fixed = fixed in [21:45:22] paladox: i take it we're fixing it earlier? [21:46:12] Thats upto releng. But im trying to fix it so releng can decide what to do. Either upgrade the ssh-slaves plugin with the update. Or to update jenkins. [21:49:32] (03CR) 10Dzahn: [C: 04-1] "removing self (i need to keep my queue down). re-add me after you amended (if you are planning to, otherwise i'd recommend to abandon)" [puppet] - 10https://gerrit.wikimedia.org/r/343211 (owner: 10Paladox) [21:50:39] paladox: i agree with mutante there ^ [21:50:56] oh [21:50:57] ok [21:51:00] (03Abandoned) 10Paladox: Phabricator: Allow us to install php7.1 for testing on labs. [puppet] - 10https://gerrit.wikimedia.org/r/343211 (owner: 10Paladox) [21:52:02] paladox: not exactly what i meant but okay :P [21:52:29] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:53:16] (03PS2) 10Dzahn: Remove programdashboard module and related hieradata [puppet] - 10https://gerrit.wikimedia.org/r/340164 (owner: 10Dduvall) [21:54:26] mutante: mind if i rebase 340164? [21:54:49] (03PS3) 10Dzahn: Remove programdashboard module and related hieradata [puppet] - 10https://gerrit.wikimedia.org/r/340164 (owner: 10Dduvall) [21:55:05] guess that answers that xD [21:55:46] Zppix: yea, it only really makes sense if you do it right before merge. i did not see your comment until i hit the button anyways [21:55:58] (03CR) 10Dzahn: [C: 032] Remove programdashboard module and related hieradata [puppet] - 10https://gerrit.wikimedia.org/r/340164 (owner: 10Dduvall) [21:56:41] mutante: i like keepign merge conflicts down it, i have bad experiences with them, and ive noticed usual rebasing can stop a mess of merge conflicts later on [21:58:22] pardon my typos i seem to be unable to spell today [21:58:50] Zppix: if you go to a random change in gerrit, and look for the field "Strategy" you can see it is "FastForward Only" in ops/puppet , but maybe different in other repos [21:59:15] Zppix: because of that strategy we will always have to rebase right before merge anyways [21:59:33] mutante: i know but still i perfer to rebase if its a heavily commited to file [21:59:40] and a "merge conflict" is the "normal" state of something waiting [21:59:49] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:00:37] (03PS3) 10Dzahn: jenkins: tweak log permissions [puppet] - 10https://gerrit.wikimedia.org/r/342210 (owner: 10Hashar) [22:00:44] mutante: its just a pet peeve of mine [22:01:12] mortals :P [22:01:29] :p that used to be a real group name, historically [22:01:37] before there was deployment and others [22:01:47] mutante: i know [22:03:39] (03CR) 10Dzahn: [C: 032] jenkins: tweak log permissions [puppet] - 10https://gerrit.wikimedia.org/r/342210 (owner: 10Hashar) [22:04:48] (03PS1) 10Dzahn: Revert "jenkins: tweak log permissions" [puppet] - 10https://gerrit.wikimedia.org/r/347121 [22:04:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:05:08] (03CR) 10Dzahn: [C: 032] "Error: Could not apply complete catalog: Found 1 dependency cycle:" [puppet] - 10https://gerrit.wikimedia.org/r/347121 (owner: 10Dzahn) [22:05:31] (03CR) 10Dzahn: [V: 032 C: 032] Revert "jenkins: tweak log permissions" [puppet] - 10https://gerrit.wikimedia.org/r/347121 (owner: 10Dzahn) [22:05:41] mutante: want me to help look for dependency? [22:05:54] Zppix: go ahead if you want to fix it [22:06:04] add hashar to it [22:06:11] mutante: what exactly am i looking for? [22:06:34] Zppix: i pasted the error above, that's all i have [22:06:44] a dependency cycle [22:06:45] mutante: ack want me to add you as well? [22:07:08] yea, once you get a +1 [22:07:53] mutante: okay, that error doesnt give me much to go on. but here we go [22:08:13] i won't review it right now. first in , first out [22:08:16] but thanks [22:08:55] (03PS1) 10Reedy: Switch EducationProgram to extension.json for extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347122 (https://phabricator.wikimedia.org/T162481) [22:09:01] mutante: no problem, the error would be in puppet repo not jenkins or intengration correct? [22:09:30] Zppix: correct [22:09:49] mutante: ill make sure i kill the dependency with a flamethrower [22:10:43] mutante: stupid question but where would i find said access log [22:12:05] Zppix: i dont know, check in the code, i'd do the same thing [22:12:12] ok [22:13:28] (03CR) 10Dzahn: [C: 032] zookeeper: remove precise's package version [puppet] - 10https://gerrit.wikimedia.org/r/345839 (owner: 10Faidon Liambotis) [22:13:40] weird i just searched all of puppet repo and nothing with name access.log [22:14:18] paladox: you wouldnt know where var/log/jenkins/ would be do you? [22:14:33] It's /var/log/jenkins [22:14:37] (03PS3) 10Dzahn: zookeeper: remove precise's package version [puppet] - 10https://gerrit.wikimedia.org/r/345839 (owner: 10Faidon Liambotis) [22:14:45] paladox: okay what repo? [22:14:52] It's not in a repo [22:14:58] jenkins generates the file. [22:15:05] So you would need jenkins installed [22:15:07] mutante: it appears im unable to smash this for you... [22:15:30] that's ok. you dont need the logfile for it though [22:15:38] the problem is all in puppet code [22:15:40] (03PS3) 10Andrew Bogott: Keystonehooks: Add two more ldap ous for sudo handling. [puppet] - 10https://gerrit.wikimedia.org/r/346880 [22:15:42] (03PS11) 10Andrew Bogott: wmfkeystonehooks: Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [22:16:01] mutante: according to the error it doesnt like something in that file [22:16:32] (03CR) 10jerkins-bot: [V: 04-1] wmfkeystonehooks: Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott) [22:16:45] Zppix: no, it's not about the content of the file, it's about dependencies between the logfile and the directory it is in [22:16:59] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:17:07] mutante: could jenkins running be causing the error? [22:17:30] jenkins needs the file to be jenkins:jenkins [22:17:32] Zppix: no [22:17:34] otherwise it fails [22:18:16] how old was that patch? [22:18:18] "dependency cycle" is the keyword here [22:18:26] !log reedy@tin Synchronized php-1.29.0-wmf.19/extensions/EducationProgram/EducationProgram.php: Load wgExtensionMessagesFiles in PHP entry point for mergeMessageLists T162481 (duration: 00m 49s) [22:18:30] lookup puppet dependencies between resources [22:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:34] T162481: Unwanted change of EducationProgram namespace - https://phabricator.wikimedia.org/T162481 [22:19:05] (03PS4) 10Dzahn: zookeeper: remove precise's package version [puppet] - 10https://gerrit.wikimedia.org/r/345839 (owner: 10Faidon Liambotis) [22:20:09] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=571 [critical =500] [22:20:29] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [22:23:19] (03CR) 10Dzahn: "no changes on conf1001/2001 or druid1001" [puppet] - 10https://gerrit.wikimedia.org/r/345839 (owner: 10Faidon Liambotis) [22:23:30] ACKNOWLEDGEMENT - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=571 [critical =500] Jeff_Green still cleaning up queue consumers post migration [22:23:31] ACKNOWLEDGEMENT - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=824 [critical =325] Jeff_Green still cleaning up queue consumers post migration [22:25:47] (03PS12) 10Andrew Bogott: wmfkeystonehooks: Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [22:28:00] (03CR) 10Andrew Bogott: [C: 04-1] wmfkeystonehooks: Create project page on wikitech on project creation (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott) [22:32:17] (03CR) 10Dzahn: "compiler fail due to unrelated issue?: Detail: Unable to read data from conftool , Error: Failed to parse template scap/dsh/dsh-group.e" [puppet] - 10https://gerrit.wikimedia.org/r/344728 (owner: 10Dzahn) [22:33:21] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/6048/etherpad1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/346921 (owner: 10Dzahn) [22:33:59] (03CR) 10Dzahn: [C: 04-1] "Error: Could not find data item mailman::lists_servername in any Hiera data file http://puppet-compiler.wmflabs.org/6049/" [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn) [22:34:19] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/6050/" [puppet] - 10https://gerrit.wikimedia.org/r/346926 (owner: 10Dzahn) [22:35:18] (03CR) 10Dzahn: "no change on iron. fail on bast1001 due to unrelated issue. bastion hosts always have " Detail: Unable to read data from conftool" in com" [puppet] - 10https://gerrit.wikimedia.org/r/345085 (owner: 10Dzahn) [22:39:55] (03PS2) 10Reedy: Switch EducationProgram to extension.json for extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347122 (https://phabricator.wikimedia.org/T162481) [22:40:09] RECOVERY - check_missing_thank_yous on db1025 is OK: OK missing_thank_yous=0 [22:40:23] (03PS4) 10Dzahn: etherpad: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346921 [22:43:48] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/6061/" [puppet] - 10https://gerrit.wikimedia.org/r/346921 (owner: 10Dzahn) [22:44:59] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [22:45:36] (03PS4) 10Dzahn: lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923 [22:51:29] (03PS5) 10Dzahn: lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923 [22:51:49] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:51:59] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:53:39] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:53:49] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [22:56:06] (03PS6) 10Dzahn: lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923 [23:04:45] (03PS7) 10Dzahn: lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923 [23:05:03] (03CR) 10Thcipriani: "inline question" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347057 (owner: 10Chad) [23:06:12] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/6066/fermium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn) [23:07:03] (03PS2) 10Dzahn: delete netmon::migration class, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/346924 [23:11:12] (03CR) 10Dzahn: [C: 032] delete netmon::migration class, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/346924 (owner: 10Dzahn) [23:16:40] !log gerrit2001 - deleting netmon1001 backup (/srv/netmon1001), stop rsyncd, remove rsyncd config (T125020) [23:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:46] T125020: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020 [23:17:51] (03CR) 10Dzahn: "16:19 < mutante> !log gerrit2001 - deleting netmon1001 backup (/srv/netmon1001), stop rsyncd, remove rsyncd config (T125020)" [puppet] - 10https://gerrit.wikimedia.org/r/346924 (owner: 10Dzahn) [23:18:33] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/346677 (owner: 10Dzahn) [23:18:51] 06Operations, 06Release-Engineering-Team: Jenkins Web UI error - https://phabricator.wikimedia.org/T162505#3165440 (10Paladox) [23:19:09] 06Operations, 06Release-Engineering-Team, 07Jenkins: Jenkins Web UI error - https://phabricator.wikimedia.org/T162505#3165426 (10Paladox) [23:19:13] 06Operations, 06Release-Engineering-Team, 07Jenkins: Jenkins Web UI error - https://phabricator.wikimedia.org/T162505#3165442 (10Zppix) Instead of constant work around why dont we fix it? [23:19:19] (03CR) 10jerkins-bot: [V: 04-1] typos: add (requires_os|os_version)\([^)]*[[:upper:]] [puppet] - 10https://gerrit.wikimedia.org/r/346677 (owner: 10Dzahn) [23:20:07] 06Operations, 06Release-Engineering-Team, 07Jenkins: Jenkins Web UI error - https://phabricator.wikimedia.org/T162505#3165443 (10Paladox) It would be good to fix that problem. [23:20:30] (03CR) 10Dzahn: "there are some more of this in module mediawiki (detected by https://gerrit.wikimedia.org/r/#/c/346677/1)" [puppet] - 10https://gerrit.wikimedia.org/r/345561 (owner: 10Faidon Liambotis) [23:22:11] mutante: i will do a follow up to ^ [23:22:24] 06Operations, 06Release-Engineering-Team, 07Jenkins: Jenkins Web UI error - https://phabricator.wikimedia.org/T162505#3165426 (10Dzahn) >>! In T162505#3165438, @Paladox wrote: > Varnish looks like the problem. I don't think so. Varnish says "Backend fetch failed". The Backend is integration.wm.org. [23:23:06] Zppix: ok [23:23:45] (03CR) 10Dzahn: "some of these are left in mediawiki module:" [puppet] - 10https://gerrit.wikimedia.org/r/346677 (owner: 10Dzahn) [23:26:54] (03CR) 10Dzahn: "blocked by switch port config removal" [dns] - 10https://gerrit.wikimedia.org/r/344651 (owner: 10Papaul) [23:28:38] (03Draft2) 10Zppix: Standardize on lowercase os_version/require_os (Part 2) [puppet] - 10https://gerrit.wikimedia.org/r/347134 [23:29:30] 06Operations, 06Release-Engineering-Team, 07Jenkins: Jenkins Web UI error - https://phabricator.wikimedia.org/T162505#3165452 (10Paladox) Oh [23:29:49] mutante: ^^ [23:30:32] (03CR) 10Thcipriani: Scap clean: Log to IRC when we prune a branch (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347061 (owner: 10Chad) [23:30:59] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:32:39] PROBLEM - puppet last run on mw1163 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:34:59] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [23:35:00] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 7 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3165454 (10Legoktm) >>! In T156924#3162419, @tstarling wrote: >>>! In T156924#3138751, @tsta... [23:36:03] (03CR) 10Dzahn: "thanks for this Zppix, but i realized these are being removed completely in https://gerrit.wikimedia.org/r/#/c/345546/" [puppet] - 10https://gerrit.wikimedia.org/r/347134 (owner: 10Zppix) [23:36:36] oh well it gave me something to do :P [23:36:48] (03Abandoned) 10Zppix: Standardize on lowercase os_version/require_os (Part 2) [puppet] - 10https://gerrit.wikimedia.org/r/347134 (owner: 10Zppix) [23:37:11] Zppix: i should have told you/seen that earlier [23:37:22] mutante: meh i was bored anyway :P [23:37:27] that other MW change is one of the few ones not merged yet [23:37:43] ok. well, i think it's time for weekend now [23:38:12] cu later Zppix [23:38:16] have a good one [23:38:37] you too! laters [23:40:32] (03PS2) 10Chad: Scap clean: Log to IRC when we prune a branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347061 [23:40:34] (03CR) 10Dzahn: "these will be removed by https://gerrit.wikimedia.org/r/#/c/345546/" [puppet] - 10https://gerrit.wikimedia.org/r/346677 (owner: 10Dzahn) [23:40:39] (03CR) 10Chad: Scap clean: Log to IRC when we prune a branch (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347061 (owner: 10Chad) [23:45:09] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=739 [critical =500] [23:49:07] (03CR) 10Thcipriani: [C: 031] Scap clean: Log to IRC when we prune a branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347061 (owner: 10Chad) [23:50:09] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=1156 [critical =500] [23:56:47] 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3165498 (10bbogaert) > The problem is somewhere in "Google Cloud Directory Sync", then. It appears as if moving a user to a different OU isn't reflected in the LDAP da...