[00:01:50] (03CR) 10Tim Starling: "Needs testing, maybe we can deploy to beta cluster first? Please review for performance, is using a MultiConfig in production problematic?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [00:09:19] (03CR) 10Dzahn: [C: 031] "this is probably totally fine, but since it's mediawiki-config repo and not puppet, please add it to a deployment swat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347468 (https://phabricator.wikimedia.org/T159438) (owner: 10Kaldari) [00:11:21] (03CR) 10Dzahn: "it should be a gerrit feature to set a date to be reminded or to auto-add a +1 once the set date has passed :)" [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [00:15:53] 06Operations, 10MediaWiki-Cache, 06Performance-Team, 10Traffic: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#3170181 (10Krinkle) [00:18:02] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3170197 (10Krinkle) [00:36:34] (03PS2) 10Tim Starling: Use EtcdConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) [00:41:28] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [00:46:28] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [00:56:11] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3170256 (10Papaul) [01:14:53] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 9 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3170263 (10tstarling) >>! In T156924#3165454, @Legoktm wrote: > `$wmg` is slowly being phase... [01:16:43] (03PS1) 10Tim Starling: Rename all WMF-specific configuration variables to have a wgWMF prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347541 [01:17:29] (03CR) 10jerkins-bot: [V: 04-1] Rename all WMF-specific configuration variables to have a wgWMF prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347541 (owner: 10Tim Starling) [01:48:16] (03PS2) 10Tim Starling: Rename all WMF-specific configuration variables to have a wgWMF prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347541 [02:06:53] !log jessie-recdns: unpausing upgrade process... [02:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:18] (03CR) 10Krinkle: "I imagine we'd want EtcdConfig to know which keys it is supposed to serve so that MultiConfig doesn't query Etcd for every has() - which w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [02:08:31] (03Draft2) 10TTO: Enable user group expiry in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347545 (https://phabricator.wikimedia.org/T159416) [02:08:42] (03CR) 10TTO: [C: 04-1] "Waiting for https://gerrit.wikimedia.org/r/341947/ to be merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347545 (https://phabricator.wikimedia.org/T159416) (owner: 10TTO) [02:10:33] (03PS1) 10BBlack: esams lvs: do not directly use nescio, temporarily [puppet] - 10https://gerrit.wikimedia.org/r/347546 [02:10:46] (03CR) 10BBlack: [V: 032 C: 032] esams lvs: do not directly use nescio, temporarily [puppet] - 10https://gerrit.wikimedia.org/r/347546 (owner: 10BBlack) [02:10:52] (03CR) 10Krinkle: "Aye, I see it stores them all in one query and then process cache. That should be fine, but worth confirming indeed. Having it run in beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [02:13:09] !log upgrading nescio to pdns-recursor 4.x [02:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:07] (03PS1) 10BBlack: Revert "esams lvs: do not directly use nescio, temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/347547 [02:15:13] (03CR) 10BBlack: [V: 032 C: 032] Revert "esams lvs: do not directly use nescio, temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/347547 (owner: 10BBlack) [02:16:32] !log bblack@puppetmaster1001 conftool action : set/pooled=yes; selector: service=pdns_recursor,name=nescio.wikimedia.org [02:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:12] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 07m 47s) [02:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:57] (03PS1) 10BBlack: eqiad lvs: do not directly use chromium, temporarily [puppet] - 10https://gerrit.wikimedia.org/r/347548 [02:29:16] (03CR) 10BBlack: [V: 032 C: 032] eqiad lvs: do not directly use chromium, temporarily [puppet] - 10https://gerrit.wikimedia.org/r/347548 (owner: 10BBlack) [02:31:52] !log bblack@neodymium conftool action : set/pooled=no; selector: name=chromium.wikimedia.org,service=pdns_recursor [02:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:52] !log upgrading chromium to pdns-recursor 4.x [02:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:28] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:28] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:28] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:28] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:28] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:29] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:29] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:30] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:30] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:31] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:31] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:32] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:32] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:33] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:48] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:48] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:48] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:48] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:48] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:53] well, that's clearly a pattern then, and recdns induced it last time as well [02:35:21] note that last time around, it was before attempting the upgrade-restart, just after depooling from pybal [02:36:04] clearly there's something wrong here in how these services depend on recdns that breaks the abstraction of the recdns LVS service and depooling... [02:36:18] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [02:36:18] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [02:36:28] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [02:36:28] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [02:36:28] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [02:36:28] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [02:36:28] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [02:36:28] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [02:36:28] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [02:36:29] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [02:36:29] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [02:36:30] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [02:36:38] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [02:36:38] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [02:36:38] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [02:36:38] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [02:36:38] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [02:36:38] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [02:36:38] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [02:36:39] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [02:36:39] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [02:36:40] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [02:36:40] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [02:36:41] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [02:36:59] perhaps they're maintaining a TCP DNS connection, and it RSTs on server depool, and this makes the service flap due to some reconnect timeout before resolution works again? it has to be some crazy scenario like that. [02:37:19] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [02:37:28] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [02:37:28] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [02:37:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [02:37:38] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [02:37:48] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [02:37:52] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=chromium.wikimedia.org,service=pdns_recursor [02:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:18] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [02:38:18] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [02:39:16] (03PS1) 10BBlack: Revert "eqiad lvs: do not directly use chromium, temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/347551 [02:39:28] (03CR) 10BBlack: [V: 032 C: 032] Revert "eqiad lvs: do not directly use chromium, temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/347551 (owner: 10BBlack) [02:43:13] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.19) (duration: 07m 16s) [02:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:28] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:47:18] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:48:18] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [02:48:48] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:48:56] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Apr 11 02:48:56 UTC 2017 (duration 5m 43s) [02:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:18] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:49:21] (03PS1) 10BBlack: esams lvs: do not directly use maerlant, temporarily [puppet] - 10https://gerrit.wikimedia.org/r/347552 [02:49:38] (03CR) 10BBlack: [V: 032 C: 032] esams lvs: do not directly use maerlant, temporarily [puppet] - 10https://gerrit.wikimedia.org/r/347552 (owner: 10BBlack) [02:50:41] !log bblack@neodymium conftool action : set/pooled=no; selector: name=maerlant.wikimedia.org,service=pdns_recursor [02:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:15] !log upgrading maerlant to pdns-recursor 4.x [02:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:52:33] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=maerlant.wikimedia.org,service=pdns_recursor [02:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:52:59] (03PS1) 10BBlack: Revert "esams lvs: do not directly use maerlant, temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/347553 [02:53:06] (03CR) 10BBlack: [V: 032 C: 032] Revert "esams lvs: do not directly use maerlant, temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/347553 (owner: 10BBlack) [02:55:01] (03PS1) 10BBlack: Revert "dnsrecursor: update to backports for transition" [puppet] - 10https://gerrit.wikimedia.org/r/347554 [02:55:06] (03PS2) 10BBlack: Revert "dnsrecursor: update to backports for transition" [puppet] - 10https://gerrit.wikimedia.org/r/347554 [02:55:13] (03CR) 10BBlack: [V: 032 C: 032] Revert "dnsrecursor: update to backports for transition" [puppet] - 10https://gerrit.wikimedia.org/r/347554 (owner: 10BBlack) [02:59:13] !log jessie recdns software upgrades complete [02:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:18] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [03:03:18] PROBLEM - puppet last run on wtp1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:03:28] RECOVERY - puppet last run on es1019 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [03:05:57] 06Operations, 10netops: Slight packet loss observed on the network starting Nov 2016 - https://phabricator.wikimedia.org/T154507#2913735 (10faidon) This is great to see and a very good catch. Nice work @ayounsi! [03:06:14] bblack: \o/ [03:08:52] euh what happened? [03:10:09] mobrovac: I did some recdns maintenance that should've been impact-free, but various RB/scb services blipped in monitoring. background: our normal recdns (from machines' resolv.conf) [03:10:27] oh ok ok [03:10:30] cool [03:10:31] ... goes to a resolver address on LVS, which distributes the DNS queries out to a pair of real recdns [03:10:41] thnx for the info bblack [03:10:46] so we can depool one or the other for maintenance without impact [03:10:52] so the .discovery.wmnet stuff basically [03:11:11] but for some reason, the act of initially depooling one of the recdns pair caused RB/scb stuff to freak out for a short period [03:11:37] there's definitely a real problem there, but I'm not sure what it is or at what layer [03:11:44] no other services freak out, and we don't expect any to [03:11:51] kk we should probably flag it for _joe_ [03:11:58] yeah I'll go over it with him tomorrow [03:12:17] cool thnx [03:31:18] RECOVERY - puppet last run on wtp1011 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [03:58:28] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [04:02:11] (03PS8) 10Mobrovac: RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 (https://phabricator.wikimedia.org/T116335) [04:11:58] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=682.20 Read Requests/Sec=333.50 Write Requests/Sec=497.40 KBytes Read/Sec=39853.20 KBytes_Written/Sec=3633.20 [04:13:28] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [04:22:58] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=60.70 Read Requests/Sec=4.00 Write Requests/Sec=3.50 KBytes Read/Sec=16.40 KBytes_Written/Sec=169.20 [05:29:51] (03PS1) 10Smalyshev: Enable "trailing poller" functionality for production. [puppet] - 10https://gerrit.wikimedia.org/r/347565 (https://phabricator.wikimedia.org/T161342) [05:49:11] (03CR) 10Giuseppe Lavagetto: "If we want this to work in beta, we need first to merge my latest confctl changes, build a package and then populate the beta etcd instanc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [05:56:54] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347567 [05:56:57] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347567 [05:57:37] (03CR) 10Mobrovac: [C: 04-1] "Not ready to go yet" [puppet] - 10https://gerrit.wikimedia.org/r/347452 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [05:58:57] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347567 (owner: 10Marostegui) [06:00:10] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347567 (owner: 10Marostegui) [06:00:20] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347567 (owner: 10Marostegui) [06:01:06] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1073 - T132416 (duration: 00m 39s) [06:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:14] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [06:01:57] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix output typo in the redis task [switchdc] - 10https://gerrit.wikimedia.org/r/347420 (owner: 10Giuseppe Lavagetto) [06:02:15] !log Deploy schema change labsdb1003 (s7) - T160390 [06:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:22] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [06:03:47] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Profile::mediawiki::jobrunner selects both "normal" jobrunners and videoscalers." [switchdc] - 10https://gerrit.wikimedia.org/r/347511 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [06:07:28] !log Deploy schema change on db1041 (eqiad master) (s7) - T160390 [06:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:35] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [06:08:46] (03CR) 10Giuseppe Lavagetto: [C: 032] Logging: filter out all cumin's messages from stderr [switchdc] - 10https://gerrit.wikimedia.org/r/347534 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [06:08:51] (03PS2) 10Giuseppe Lavagetto: Logging: filter out all cumin's messages from stderr [switchdc] - 10https://gerrit.wikimedia.org/r/347534 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [06:08:57] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Logging: filter out all cumin's messages from stderr [switchdc] - 10https://gerrit.wikimedia.org/r/347534 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [06:11:49] (03CR) 10Giuseppe Lavagetto: "> It looks more a workaround than a fix, probably at this point it" [switchdc] - 10https://gerrit.wikimedia.org/r/347423 (owner: 10Giuseppe Lavagetto) [06:20:28] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3170419 (10Marostegui) >>! In T149006#3168914, @Gehel wrote: > > > @Marostegui how did you diagnose the CPU issue? > > At some point we cha... [06:21:12] (03PS2) 10Giuseppe Lavagetto: Correct offset of the main task to be 0 [switchdc] - 10https://gerrit.wikimedia.org/r/347423 [06:21:14] (03PS1) 10Giuseppe Lavagetto: Clarify documentation for puppet disable job [switchdc] - 10https://gerrit.wikimedia.org/r/347568 [06:22:55] (03PS1) 10Marostegui: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347569 (https://phabricator.wikimedia.org/T132416) [06:24:32] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347569 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [06:25:53] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347569 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [06:26:31] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347569 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [06:26:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1072 - T132416 (duration: 00m 43s) [06:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:53] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [06:28:42] !log Deploy alter table enwiki.revision db1072 - T132416 [06:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:32] !log restart hhvm on mw1299 - dump debug in /tmp/hhvm.84379.bt [06:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:28] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [06:42:28] PROBLEM - cassandra-a SSL 10.192.32.137:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [06:42:58] PROBLEM - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.137 and port 9042: Connection refused [06:44:08] PROBLEM - Check systemd state on restbase2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:44:58] PROBLEM - cassandra-a service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [06:48:32] !log installing jasper security updates [06:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:04] !log Deploy alter table enwiki.revision dbstore1002 - T132416 [06:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:11] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [06:52:47] quando un qualunque italofono è in giro... :) [06:56:58] RECOVERY - cassandra-a service on restbase2004 is OK: OK - cassandra-a is active [06:57:08] RECOVERY - Check systemd state on restbase2004 is OK: OK - running: The system is fully operational [06:58:11] !log restarted cassandra-a on restbase2004, crashed with "out of heap memory" [06:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:38] RECOVERY - cassandra-a SSL 10.192.32.137:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-a valid until 2017-09-12 15:35:23 +0000 (expires in 154 days) [06:58:58] RECOVERY - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on 10.192.32.137 port 9042 [07:55:23] (03CR) 10Volans: [C: 032] Clarify documentation for puppet disable job [switchdc] - 10https://gerrit.wikimedia.org/r/347568 (owner: 10Giuseppe Lavagetto) [07:55:42] (03Abandoned) 10Volans: Disable puppet: add videoscalers [switchdc] - 10https://gerrit.wikimedia.org/r/347511 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [07:56:43] (03CR) 10Volans: [C: 032] Correct offset of the main task to be 0 [switchdc] - 10https://gerrit.wikimedia.org/r/347423 (owner: 10Giuseppe Lavagetto) [07:57:01] (03CR) 10jerkins-bot: [V: 04-1] Clarify documentation for puppet disable job [switchdc] - 10https://gerrit.wikimedia.org/r/347568 (owner: 10Giuseppe Lavagetto) [07:57:15] 06Operations: pinentry-gtk2 pulls in a lot of unneeded Gnome/GTK libs - https://phabricator.wikimedia.org/T127054#3170604 (10MoritzMuehlenhoff) This still happens and we have GTK/Gnome base libs installed on > 1000 servers now, we should really sort this out for stretch... [07:58:58] (03CR) 10Volans: [C: 032] "recheck" [switchdc] - 10https://gerrit.wikimedia.org/r/347568 (owner: 10Giuseppe Lavagetto) [08:11:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] "a final few inline comments, the rest LGTM" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn) [08:31:13] 06Operations, 10Cassandra, 06Services (done): RAID-0 volume not mounted on restbase-dev1001.eqiad.wmnet - https://phabricator.wikimedia.org/T162614#3170616 (10elukey) @Eevans thansk a lot for the details, I had no idea that these manual steps should have been done (I thought that partman would have created e... [08:37:08] (03PS1) 10Giuseppe Lavagetto: Properly handle inserting menu items with an explicit index [switchdc] - 10https://gerrit.wikimedia.org/r/347573 [08:53:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] ircecho: Convert to base::service class to maintain the script (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/347518 (owner: 10Paladox) [09:09:50] 06Operations, 10netops: Slight packet loss observed on the network starting Nov 2016 - https://phabricator.wikimedia.org/T154507#3170689 (10fgiunchedi) Indeed, thanks a lot @ayounsi for fixing this long-standing issue! [09:11:31] !log upgrade swift to 2.2.0 on ms-be2001 - T162609 [09:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:38] T162609: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609 [09:15:13] (03PS2) 10Alexandros Kosiaris: Make backup::set effectively a virtual resource [puppet] - 10https://gerrit.wikimedia.org/r/347388 [09:17:44] !log install remaining pam updates from jessie point update [09:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:36] (03PS1) 10Elukey: Correct some typos for analytics10[64,68] [dns] - 10https://gerrit.wikimedia.org/r/347577 (https://phabricator.wikimedia.org/T162216) [09:20:59] volans: mind to sanity check? --^ [09:21:04] sure [09:21:35] godog: swift upgrades \o/ [09:22:18] 06Operations, 10media-storage: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3170703 (10fgiunchedi) [09:22:26] (03CR) 10Volans: [C: 031] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/347577 (https://phabricator.wikimedia.org/T162216) (owner: 10Elukey) [09:23:09] thanks volans! [09:23:22] yw :) [09:23:56] (03PS3) 10Alexandros Kosiaris: Make backup::set effectively a virtual resource [puppet] - 10https://gerrit.wikimedia.org/r/347388 [09:24:41] 06Operations, 10media-storage: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3170704 (10fgiunchedi) [09:24:47] ema: aye, baby steps :)) [09:25:21] 06Operations, 10media-storage: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3168478 (10fgiunchedi) >>! In T162609#3168565, @faidon wrote: > The plan above totally makes sense to me and sounds like the path of the least amount of work with the maximum amount of consistency. > >... [09:25:31] (03CR) 10Elukey: [C: 032] Correct some typos for analytics10[64,68] [dns] - 10https://gerrit.wikimedia.org/r/347577 (https://phabricator.wikimedia.org/T162216) (owner: 10Elukey) [09:26:22] (03Abandoned) 10Gehel: maps - increase number of retries before alert for posttgresql lag check [puppet] - 10https://gerrit.wikimedia.org/r/346710 (https://phabricator.wikimedia.org/T162345) (owner: 10Gehel) [09:27:49] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3170708 (10Pokefan95) p:05Unbreak!>03High It seems this is no longer affecting our files, so I lowered the priority.... [09:28:18] PROBLEM - DPKG on bast2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:28:18] PROBLEM - DPKG on bast3002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:29:37] ^ these are related to puppet's installation status [09:29:54] puppet and puppet-common are in "iU" state [09:31:18] RECOVERY - DPKG on bast3002 is OK: All packages OK [09:34:18] RECOVERY - DPKG on bast2001 is OK: All packages OK [09:38:38] (03PS4) 10Alexandros Kosiaris: Make backup::set effectively a virtual resource [puppet] - 10https://gerrit.wikimedia.org/r/347388 [09:40:06] akosiaris: sorry that patch will take a while to process by CI :( [09:40:12] there is some deadlock in Jenkins I am investigating [09:41:10] hashar: no worries.. not in a rush [09:54:10] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3134452 (10Rameshti) >>! In T161529#3166608, @DatGuy wrote: > Blocked for logo. Waiting to hear what "The Free Encyclopedia" is in Doteli. एक खुल्ला ज्ञान भँणार [10:10:18] PROBLEM - Disk space on prometheus2002 is CRITICAL: DISK CRITICAL - free space: /srv 609 MB (0% inode=82%) [10:10:28] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:12:44] (03PS5) 10Paladox: ircecho: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518 [10:12:48] (03CR) 10Paladox: ircecho: Convert to base::service class to maintain the script (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/347518 (owner: 10Paladox) [10:12:53] (03PS6) 10Paladox: ircecho: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518 [10:13:03] (03CR) 10Paladox: "Tested locally and works." [puppet] - 10https://gerrit.wikimedia.org/r/347518 (owner: 10Paladox) [10:13:03] !log upgrading wtp1010-wtp1019 to Linux 4.9 [10:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:28] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:15:48] (03PS1) 10Volans: Mediawiki: return the right value when checking config [switchdc] - 10https://gerrit.wikimedia.org/r/347580 (https://phabricator.wikimedia.org/T160178) [10:18:44] (03CR) 10jerkins-bot: [V: 04-1] Mediawiki: return the right value when checking config [switchdc] - 10https://gerrit.wikimedia.org/r/347580 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [10:19:27] (03PS2) 10Volans: Mediawiki: return the right value when checking config [switchdc] - 10https://gerrit.wikimedia.org/r/347580 (https://phabricator.wikimedia.org/T160178) [10:23:07] (03PS1) 10Addshore: Enable TwoColConflict BetaFeature on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347582 (https://phabricator.wikimedia.org/T162370) [10:30:27] (03PS1) 10Addshore: Enable alternate RevSlider slider on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347583 (https://phabricator.wikimedia.org/T160410) [10:31:47] (03PS1) 10Addshore: Remove redundant testwiki from wmgUseLinter (already has group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347584 [10:32:25] (03PS2) 10Addshore: Remove redundant testwiki from wmgUseLinter (already has group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347584 [10:33:10] (03PS1) 10Addshore: Fix spaces to tabs in labs wgRevisionSliderAlternateSlider array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347585 [10:33:27] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3170792 (10fgiunchedi) I think all disappearing files should be back now as rebalance has finished. We are working on bri... [10:43:08] 06Operations, 10media-storage, 15User-fgiunchedi: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3170801 (10fgiunchedi) [10:46:19] !log upgrading wtp1020-wtp1024 to Linux 4.9 [10:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:09] PROBLEM - swift-object-auditor on ms-be2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:53:09] PROBLEM - swift-container-updater on ms-be2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:53:09] PROBLEM - dhclient process on ms-be2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:53:09] PROBLEM - swift-object-server on ms-be2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:53:18] PROBLEM - swift-account-server on ms-be2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:53:59] RECOVERY - swift-object-auditor on ms-be2002 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [10:53:59] RECOVERY - dhclient process on ms-be2002 is OK: PROCS OK: 0 processes with command name dhclient [10:53:59] RECOVERY - swift-object-server on ms-be2002 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [10:53:59] RECOVERY - swift-container-updater on ms-be2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [10:54:08] RECOVERY - swift-account-server on ms-be2002 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [10:59:58] (03CR) 10Addshore: [C: 032] Fix spaces to tabs in labs wgRevisionSliderAlternateSlider array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347585 (owner: 10Addshore) [11:01:04] (03Merged) 10jenkins-bot: Fix spaces to tabs in labs wgRevisionSliderAlternateSlider array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347585 (owner: 10Addshore) [11:01:13] (03CR) 10jenkins-bot: Fix spaces to tabs in labs wgRevisionSliderAlternateSlider array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347585 (owner: 10Addshore) [11:01:51] (03PS2) 10Giuseppe Lavagetto: Properly handle inserting menu items with an explicit index [switchdc] - 10https://gerrit.wikimedia.org/r/347573 [11:02:18] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: NOOP (Beta file only) - Fix some tabs (duration: 00m 39s) [11:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:14] PROBLEM - Hadoop DataNode on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [11:05:27] new worker --^ [11:05:28] (03PS3) 10Addshore: Remove redundant testwiki from wmgUseLinter (already has group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347584 [11:06:12] (03PS1) 10Volans: Mediawiki: explicitly use UTC for the date to print [switchdc] - 10https://gerrit.wikimedia.org/r/347588 (https://phabricator.wikimedia.org/T160178) [11:06:14] RECOVERY - Hadoop DataNode on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [11:06:23] (03CR) 10Addshore: [C: 032] Remove redundant testwiki from wmgUseLinter (already has group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347584 (owner: 10Addshore) [11:07:15] PROBLEM - Check systemd state on analytics1068 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:07:26] (03Merged) 10jenkins-bot: Remove redundant testwiki from wmgUseLinter (already has group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347584 (owner: 10Addshore) [11:07:35] (03CR) 10jenkins-bot: Remove redundant testwiki from wmgUseLinter (already has group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347584 (owner: 10Addshore) [11:09:58] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: NOOP - [[gerrit:347584|Remove redundant testwiki from wmgUseLinter (already has group0)]] (duration: 00m 39s) [11:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:14] RECOVERY - Check systemd state on analytics1068 is OK: OK - running: The system is fully operational [11:14:30] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3134452 (10Janak_bhatta) Project name: विकिपिडिया Project namespace: विकिपिडिया Project talk namespace: विकिपिडिया_कुरडि [11:17:33] (03CR) 10WMDE-Fisch: [C: 031] Enable alternate RevSlider slider on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347583 (https://phabricator.wikimedia.org/T160410) (owner: 10Addshore) [11:18:14] PROBLEM - Hadoop DataNode on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [11:18:16] PROBLEM - Check systemd state on analytics1068 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:19:07] elukey: ^^^ [11:19:11] yep yep it is me [11:19:18] at least it's up :D [11:19:20] I scheduled little downtime and it expired [11:19:26] yeah this is my fault :) [11:20:24] PROBLEM - Hadoop NodeManager on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:20:30] (03PS1) 10Giuseppe Lavagetto: Allow suppressing SAN warnings from urllib3 [software/conftool] - 10https://gerrit.wikimedia.org/r/347591 [11:21:30] (03PS1) 10Addshore: Remove redundant wmgUseRevisionSlider in InitilizeSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347592 [11:21:39] 06Operations, 10Traffic, 10media-storage: swift-object-server 1.13.1: Wrong Content-Type returned on 304 Not Modified responses - https://phabricator.wikimedia.org/T162348#3170871 (10fgiunchedi) As far as swift upstream is concerned this issue was raised before in https://review.openstack.org/#/c/150149/ but... [11:22:09] (03CR) 10Addshore: [C: 032] Remove redundant wmgUseRevisionSlider in InitilizeSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347592 (owner: 10Addshore) [11:22:51] (03PS2) 10Addshore: Remove redundant wmgUseRevisionSlider in InitialiseSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347592 [11:23:38] (03CR) 10Addshore: Remove redundant wmgUseRevisionSlider in InitialiseSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347592 (owner: 10Addshore) [11:23:41] (03CR) 10Addshore: [C: 032] Remove redundant wmgUseRevisionSlider in InitialiseSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347592 (owner: 10Addshore) [11:24:56] (03Merged) 10jenkins-bot: Remove redundant wmgUseRevisionSlider in InitialiseSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347592 (owner: 10Addshore) [11:26:33] (03CR) 10jenkins-bot: Remove redundant wmgUseRevisionSlider in InitialiseSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347592 (owner: 10Addshore) [11:27:14] RECOVERY - Check systemd state on analytics1068 is OK: OK - running: The system is fully operational [11:27:20] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: NOOP (Beta file only) - [[gerrit:347592|Remove redundant wmgUseRevisionSlider in InitialiseSettings-labs]] (duration: 00m 38s) [11:27:24] RECOVERY - Hadoop NodeManager on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:33] !log wtp2* to Linux 4.9 [11:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:14] RECOVERY - Hadoop DataNode on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [11:28:55] (03PS1) 10Giuseppe Lavagetto: Suppress the SAN warnings for confctl [switchdc] - 10https://gerrit.wikimedia.org/r/347593 [11:32:52] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3170918 (10elukey) analytics1064 and 1068 should be up and running now! [11:33:49] !log resume reboot of analytics1040->1050 for kernel upgrades [11:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:34] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:35:44] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:36:44] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:37:24] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [11:42:45] !log upgrade cache_misc to linux 4.9 T162029 [11:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:52] T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 [11:43:10] (03PS3) 10Volans: Mediawiki: return the right value when checking config [switchdc] - 10https://gerrit.wikimedia.org/r/347580 (https://phabricator.wikimedia.org/T160178) [11:45:22] (03PS4) 10Volans: Mediawiki: return the right value when checking config [switchdc] - 10https://gerrit.wikimedia.org/r/347580 (https://phabricator.wikimedia.org/T160178) [11:45:26] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3170938 (10Rameshti) Project name: विकिपीडिया Project namespace: विकिपीडिया Project talk namespace: विकिपीडिया_कुरडी [11:46:35] (03PS5) 10Volans: Mediawiki: return the right value when checking config [switchdc] - 10https://gerrit.wikimedia.org/r/347580 (https://phabricator.wikimedia.org/T160178) [11:47:13] (03CR) 10Giuseppe Lavagetto: [C: 031] Mediawiki: return the right value when checking config [switchdc] - 10https://gerrit.wikimedia.org/r/347580 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [11:47:59] (03CR) 10Giuseppe Lavagetto: [C: 031] "Not really needed as our clocks are in utc, but doesn't hurt if someone has weird locale settings I guess." [switchdc] - 10https://gerrit.wikimedia.org/r/347588 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [11:48:52] (03CR) 10Volans: [C: 04-1] "minor comment" (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/347593 (owner: 10Giuseppe Lavagetto) [11:50:44] (03PS2) 10Giuseppe Lavagetto: Suppress the SAN warnings for confctl [switchdc] - 10https://gerrit.wikimedia.org/r/347593 [11:51:15] (03CR) 10Volans: [C: 04-1] "minor comment" (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/347591 (owner: 10Giuseppe Lavagetto) [11:52:15] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3170959 (10ayounsi) return part UPS tracking#: 1Z81648Y9142072038 [11:54:24] (03CR) 10Volans: [C: 032] Mediawiki: return the right value when checking config [switchdc] - 10https://gerrit.wikimedia.org/r/347580 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [12:00:54] (03CR) 10Volans: [C: 04-1] "Minor comments, looks good otherwise" (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347573 (owner: 10Giuseppe Lavagetto) [12:19:39] 06Operations: pinentry-gtk2 pulls in a lot of unneeded Gnome/GTK libs - https://phabricator.wikimedia.org/T127054#2030914 (10faidon) Ugh! In d-i-test: ``` faidon@d-i-test:~$ dpkg -l |egrep 'gnupg|mutt|pinentry' ii gnupg 2.1.18-6 amd64 GNU privacy guard - a fr... [12:19:51] (03PS2) 10Volans: Mediawiki: explicitly use UTC for the date to print [switchdc] - 10https://gerrit.wikimedia.org/r/347588 (https://phabricator.wikimedia.org/T160178) [12:25:40] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3171078 (10faidon) @Ottomata @RobH this seems to have been stalled somewhere between you two. Could you guys figure this and and T159839 out this week? Thanks! [12:26:47] 06Operations: pinentry-gtk2 pulls in a lot of unneeded Gnome/GTK libs - https://phabricator.wikimedia.org/T127054#3171081 (10MoritzMuehlenhoff) Ack for jessie, I'll have a look at the rdepends of the various packages, we should be able to trim these via puppet [12:28:33] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724967 (10Marostegui) [12:32:34] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [12:37:34] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [12:37:43] 06Operations, 10Traffic: Server hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T156033#3171126 (10BBlack) [12:37:59] 06Operations, 10Traffic: Server hardware installation for Asia Cache DC - https://phabricator.wikimedia.org/T156032#3171129 (10BBlack) [12:39:10] 06Operations, 10Traffic: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3171131 (10BBlack) [12:39:46] 06Operations, 10Traffic: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#3171146 (10BBlack) [12:40:21] 06Operations, 10Traffic: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#3171160 (10BBlack) [12:40:25] 06Operations, 10Traffic: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3171161 (10BBlack) [12:40:27] 06Operations, 10Traffic: Name Asia Cache DC site - https://phabricator.wikimedia.org/T156028#3171162 (10BBlack) [12:40:29] 06Operations, 10Traffic: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#3171159 (10BBlack) [12:41:38] 06Operations, 10Traffic: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3171131 (10BBlack) [12:41:40] (03PS12) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [12:41:40] 06Operations, 10Traffic: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#3171146 (10BBlack) [12:41:53] 06Operations, 10Traffic: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3171131 (10BBlack) [12:41:55] 06Operations, 10Traffic: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#2962044 (10BBlack) [12:42:17] 06Operations, 10Traffic: Select site vendor for Asia Cache Datacenter - https://phabricator.wikimedia.org/T156030#3171168 (10BBlack) [12:42:19] 06Operations, 10Traffic: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3171131 (10BBlack) [12:42:45] jouncebot: next [12:42:45] In 0 hour(s) and 17 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170411T1300) [12:42:50] 06Operations, 10Traffic: Select or Acquire Address Space for Asia Cache DC - https://phabricator.wikimedia.org/T156256#3171174 (10BBlack) [12:42:52] 06Operations, 10Traffic: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#3171173 (10BBlack) [12:42:54] 06Operations, 10Traffic: Select site vendor for Asia Cache Datacenter - https://phabricator.wikimedia.org/T156030#2962033 (10BBlack) [12:42:56] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3171176 (10BBlack) [12:43:17] 06Operations, 10Traffic: Name Asia Cache DC site - https://phabricator.wikimedia.org/T156028#3171177 (10BBlack) [12:43:19] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#2962020 (10BBlack) [12:44:22] 06Operations, 10Traffic: Name Asia Cache DC site - https://phabricator.wikimedia.org/T156028#2962007 (10BBlack) [12:44:23] 06Operations, 10Traffic: Select or Acquire Address Space for Asia Cache DC - https://phabricator.wikimedia.org/T156256#2968867 (10BBlack) [12:44:26] 06Operations, 10Traffic: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#3171179 (10BBlack) [12:45:26] (03PS1) 10Elukey: Set Trusty for mw2246 default PXE OS [puppet] - 10https://gerrit.wikimedia.org/r/347597 [12:45:55] 06Operations, 10Traffic: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#3171146 (10BBlack) p:05Triage>03Normal [12:46:18] 06Operations, 10Traffic: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3171131 (10BBlack) p:05Triage>03Normal [12:46:33] (03CR) 10Elukey: [C: 032] Set Trusty for mw2246 default PXE OS [puppet] - 10https://gerrit.wikimedia.org/r/347597 (owner: 10Elukey) [12:46:35] !log Deploy schema change on db1069 (s7 instance) - T160390 [12:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:43] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [12:47:51] !log reimage mw2246 (Debian codfw videoscaler) to Trusty [12:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:13] I know I am a bad person, +1 to the Trusty counter [12:49:12] (03CR) 10Volans: [C: 032] Mediawiki: explicitly use UTC for the date to print [switchdc] - 10https://gerrit.wikimedia.org/r/347588 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [12:49:55] jynus: assgning the task to the user of T162677 as well? [12:49:56] T162677: s52584 is taking over half of the available connections on toolsdb - https://phabricator.wikimedia.org/T162677 [12:50:36] jouncebot: next [12:50:36] In 0 hour(s) and 9 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170411T1300) [12:52:43] hashar: did you take a look at the patches for eu swat today? [12:53:00] I can do the deploy (unless you want to), but it would be great if you could take a look [12:53:50] Sagan, not sure what you mean? [12:53:51] (03PS4) 10Zfilipin: Give sysops ability to promote users to eliminator at fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347166 (https://phabricator.wikimedia.org/T162396) (owner: 10Urbanecm) [12:53:56] zeljkof: not yet [12:53:59] (03PS2) 10Zfilipin: Increase default image thumbnail size on Finnish Wikipedia to 250px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347436 (https://phabricator.wikimedia.org/T162376) (owner: 10Urbanecm) [12:54:02] (03PS2) 10Zfilipin: Add autopatrolled group to svwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345965 (https://phabricator.wikimedia.org/T161919) (owner: 10Urbanecm) [12:54:06] (03PS6) 10Zfilipin: Enable RCFilters beta feature on fawiki, ruwiki, trwiki, and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343438 (https://phabricator.wikimedia.org/T144458) (owner: 10Catrope) [12:54:37] (03CR) 10Hashar: [C: 031] Give sysops ability to promote users to eliminator at fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347166 (https://phabricator.wikimedia.org/T162396) (owner: 10Urbanecm) [12:54:57] (03CR) 10Hashar: [C: 031] Increase default image thumbnail size on Finnish Wikipedia to 250px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347436 (https://phabricator.wikimedia.org/T162376) (owner: 10Urbanecm) [12:55:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] ircecho: Convert to base::service class to maintain the script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/347518 (owner: 10Paladox) [12:55:18] (03CR) 10Hashar: [C: 031] Add autopatrolled group to svwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345965 (https://phabricator.wikimedia.org/T161919) (owner: 10Urbanecm) [12:55:19] !log powercycling wtp2013, stuck during reboot [12:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:25] (03PS13) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [12:57:00] (03PS7) 10Paladox: ircecho: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518 [12:57:03] (03CR) 10Paladox: ircecho: Convert to base::service class to maintain the script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/347518 (owner: 10Paladox) [12:57:08] (03PS8) 10Paladox: ircecho: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518 [12:57:15] (03PS24) 10BBlack: [POC] DNS zones to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/342887 [12:57:48] (03PS9) 10Paladox: ircecho: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518 [12:59:03] (03CR) 10Hashar: Enable RCFilters beta feature on fawiki, ruwiki, trwiki, and frwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343438 (https://phabricator.wikimedia.org/T144458) (owner: 10Catrope) [12:59:28] zeljkof: they look good, with the delta of RCFilters beta feature which might need clarification. But RoanKattouw would know for sure [12:59:39] hashar: thanks [12:59:42] zeljkof: anyway it can be deployed anyway. That looks good [12:59:54] hashar: should I just merge them all? or one by one? [13:00:01] up to you :} [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170411T1300). Please do the needful. [13:00:04] Urbanecm and stephanebisson: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:10] o/ [13:00:11] I'm here [13:00:15] hello [13:00:19] I am upgrading JJB version meanwhile [13:00:20] hashar: in that case, one by one :) [13:00:34] stephanebisson: want to deploy your own change? [13:00:38] or should I? [13:00:40] zeljkof, hashar: it's intentional since frwiki doesn't have ORES [13:00:48] 06Operations, 10ops-codfw, 06Performance-Team, 15User-fgiunchedi: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3171220 (10fgiunchedi) [13:00:49] zeljkof: please go ahead [13:00:58] stephanebisson: ok [13:01:15] zeljkof: there's also a maintenance script to be run on 3 wikis [13:01:41] !log roll-upgrade swift to 2.2.0 across codfw machines - T162609 [13:01:42] Urbanecm, stephanebisson: can your patches be tested at mwdebug1002? (once I push them there) [13:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:48] T162609: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609 [13:01:57] zeljkof: I think so [13:02:05] stephanebisson: can you please paste the commands that need to be run to the calendar? [13:02:07] zeljkof: they should be [13:02:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] ircecho: Convert to base::service class to maintain the script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/347518 (owner: 10Paladox) [13:02:58] (03PS10) 10Paladox: ircecho: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518 [13:03:01] (03CR) 10Paladox: ircecho: Convert to base::service class to maintain the script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/347518 (owner: 10Paladox) [13:03:08] stephanebisson: will ping you when your patch is at mwdebug1002, I will deploy Urbanecm's patches first [13:03:20] zeljkof: ok [13:03:58] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347166 (https://phabricator.wikimedia.org/T162396) (owner: 10Urbanecm) [13:05:36] (03Merged) 10jenkins-bot: Give sysops ability to promote users to eliminator at fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347166 (https://phabricator.wikimedia.org/T162396) (owner: 10Urbanecm) [13:06:18] (03CR) 10Alexandros Kosiaris: [C: 031] "running a pcc job and will merge after that. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/347518 (owner: 10Paladox) [13:06:36] (03CR) 10jenkins-bot: Give sysops ability to promote users to eliminator at fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347166 (https://phabricator.wikimedia.org/T162396) (owner: 10Urbanecm) [13:06:39] (03CR) 10Paladox: "> running a pcc job and will merge after that. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/347518 (owner: 10Paladox) [13:07:24] zeljkof: I've added the command to the deployment page [13:07:30] stephanebisson: thanks! [13:07:54] Urbanecm: 347166 is at mwdebug1002, please test and let me know when I can deploy to cluster [13:08:28] (03PS4) 10Gehel: [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [13:08:39] (03CR) 10jerkins-bot: [V: 04-1] [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 (owner: 10Gehel) [13:09:00] zeljkof: please deploy [13:09:25] Urbanecm: deploying... [13:09:32] zeljkof: ack [13:10:32] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:347166|Give sysops ability to promote users to eliminator at fawiki (T162396)]] (duration: 00m 39s) [13:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:39] T162396: Giving sysops ability to give Eliminator access at fawiki - https://phabricator.wikimedia.org/T162396 [13:11:17] hashar: just got a scary scap message [13:11:37] zeljkof: paste? [13:11:53] pasting... [13:12:00] though I can find the logs in logstash :D [13:12:48] https://phabricator.wikimedia.org/P5239 [13:13:00] looks like a machine has different key [13:13:12] not sure how to make sure [13:13:56] Offending ECDSA key in /etc/ssh/ssh_known_hosts:1018 [13:13:58] remove with: ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R mw2246.codfw.wmnet [13:14:00] guess that machine got reimaged at some point [13:14:11] and puppet hasn't run yet on tin.eqiad.wmnet to regenerate the ssh_known_hosts file [13:14:29] 12:47 reimage mw2246 (Debian codfw videoscaler) to Trusty [13:14:36] zeljkof: ^^^ yeah host has been reimaged [13:14:42] so you can essentially ignore it [13:14:47] (03PS5) 10Gehel: [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [13:15:05] hashar: ok, thanks, so I just continue scap as usual, and ignore the error message? [13:15:23] yeah [13:15:54] hashar: thanks, I always get scared when I see "IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!" :) [13:15:55] zeljkof, hashar: I don't think ssh would allow to connect when key is offending, just thinking loudly [13:15:56] zeljkof: sorry my bad [13:16:07] I forgot to remove it from dsh [13:16:08] yeah that is the message ssh yields about [13:16:16] it refuses to connect because the target host is not known [13:16:23] elukey: can you fix it? [13:16:29] but that would be fixed magically automatically at some point [13:16:33] just ignore it for now :} [13:16:40] hashar: ok, swat continues [13:16:57] Urbanecm: 347166 is deployed to cluster [13:17:00] zeljkof: ack [13:17:00] zeljkof: yep fixing in 1 min [13:17:03] continuing with the second patch [13:17:08] elukey: thanks! [13:17:11] I am running puppet on tin [13:17:14] to update the dsh [13:17:21] if you wait 1 sec it should be done [13:17:46] I need a minute or two for the second patch anyway [13:17:47] zeljkof: done! [13:17:55] sorry again [13:17:57] elukey: that was quick, thanks! :) [13:18:01] no problem [13:18:23] (03CR) 10jerkins-bot: [V: 04-1] [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 (owner: 10Gehel) [13:18:49] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347436 (https://phabricator.wikimedia.org/T162376) (owner: 10Urbanecm) [13:19:16] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [13:20:13] (03PS6) 10Gehel: [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [13:20:39] (03PS3) 10Zfilipin: Increase default image thumbnail size on Finnish Wikipedia to 250px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347436 (https://phabricator.wikimedia.org/T162376) (owner: 10Urbanecm) [13:20:55] (03PS1) 10Ottomata: Move hadooop python package declaration into client [puppet] - 10https://gerrit.wikimedia.org/r/347606 [13:21:49] (03PS14) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [13:22:46] (03CR) 10Zfilipin: Increase default image thumbnail size on Finnish Wikipedia to 250px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347436 (https://phabricator.wikimedia.org/T162376) (owner: 10Urbanecm) [13:22:53] (03CR) 10Zfilipin: [C: 032] Increase default image thumbnail size on Finnish Wikipedia to 250px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347436 (https://phabricator.wikimedia.org/T162376) (owner: 10Urbanecm) [13:23:57] (03Merged) 10jenkins-bot: Increase default image thumbnail size on Finnish Wikipedia to 250px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347436 (https://phabricator.wikimedia.org/T162376) (owner: 10Urbanecm) [13:24:03] (03CR) 10Ottomata: [C: 032] Move hadooop python package declaration into client [puppet] - 10https://gerrit.wikimedia.org/r/347606 (owner: 10Ottomata) [13:24:11] (03CR) 10jenkins-bot: Increase default image thumbnail size on Finnish Wikipedia to 250px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347436 (https://phabricator.wikimedia.org/T162376) (owner: 10Urbanecm) [13:24:23] (03PS3) 10Zfilipin: Add autopatrolled group to svwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345965 (https://phabricator.wikimedia.org/T161919) (owner: 10Urbanecm) [13:25:36] Urbanecm: 347436 is at mwdebug1002 [13:26:38] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345965 (https://phabricator.wikimedia.org/T161919) (owner: 10Urbanecm) [13:26:56] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:04] (03PS1) 10Ottomata: Revert "Move hadooop python package declaration into client" [puppet] - 10https://gerrit.wikimedia.org/r/347607 [13:27:12] (03CR) 10Ottomata: [V: 032 C: 032] Revert "Move hadooop python package declaration into client" [puppet] - 10https://gerrit.wikimedia.org/r/347607 (owner: 10Ottomata) [13:27:37] (03CR) 10Zfilipin: Add autopatrolled group to svwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345965 (https://phabricator.wikimedia.org/T161919) (owner: 10Urbanecm) [13:27:43] (03PS2) 10Ottomata: Revert "Move hadooop python package declaration into client" [puppet] - 10https://gerrit.wikimedia.org/r/347607 [13:27:52] (03CR) 10Ottomata: [V: 032 C: 032] Revert "Move hadooop python package declaration into client" [puppet] - 10https://gerrit.wikimedia.org/r/347607 (owner: 10Ottomata) [13:28:56] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python3-mmh3],Package[python3-tabulate],Package[python3-nltk] [13:29:37] PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[puppet] [13:30:23] Urbanecm: can I deploy 347436 to cluster? still testing? [13:30:46] zeljkof: no, I oversaw your message so I didn't started the testing. Sorry and thank you for the ping! [13:32:08] PROBLEM - puppet last run on mw1288 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[puppet] [13:32:18] PROBLEM - puppet last run on elastic2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:32:54] zeljkof: working [13:33:17] Urbanecm: deploying [13:33:23] zeljkof: ack [13:34:30] (03PS1) 10Ottomata: Install jessie by default on stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/347608 [13:34:42] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:347436|Increase default image thumbnail size on Finnish Wikipedia to 250px (T162376)]] (duration: 00m 39s) [13:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:48] T162376: Increase default image thumbnail size on Finnish Wikipedia to 250px - https://phabricator.wikimedia.org/T162376 [13:34:50] Urbanecm: deployed, please test [13:34:59] zeljkof: working on it [13:35:10] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345965 (https://phabricator.wikimedia.org/T161919) (owner: 10Urbanecm) [13:35:28] zeljkof: working [13:36:06] (03Merged) 10jenkins-bot: Add autopatrolled group to svwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345965 (https://phabricator.wikimedia.org/T161919) (owner: 10Urbanecm) [13:36:30] (03CR) 10jenkins-bot: Add autopatrolled group to svwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345965 (https://phabricator.wikimedia.org/T161919) (owner: 10Urbanecm) [13:37:46] Urbanecm: 345965 is at mwdebug1002 [13:38:07] (03PS7) 10Zfilipin: Enable RCFilters beta feature on fawiki, ruwiki, trwiki, and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343438 (https://phabricator.wikimedia.org/T144458) (owner: 10Catrope) [13:38:26] zeljkof: working [13:38:37] Urbanecm: deploying [13:39:30] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:345965|Add autopatrolled group to svwiktionary (T161919)]] (duration: 00m 39s) [13:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:37] T161919: Please enable the autopatrol group for the Swedish Wiktionary - https://phabricator.wikimedia.org/T161919 [13:39:52] Urbanecm: the last patch is deployed to cluster, please test and thanks for deploying with #releng :) [13:40:05] stephanebisson: your patch is next [13:40:30] (03PS5) 10Alexandros Kosiaris: Make backup::set effectively a virtual resource [puppet] - 10https://gerrit.wikimedia.org/r/347388 [13:40:37] zeljkof: working and thank you for the deploys! [13:40:39] (03CR) 10Alexandros Kosiaris: [C: 032] Make backup::set effectively a virtual resource [puppet] - 10https://gerrit.wikimedia.org/r/347388 (owner: 10Alexandros Kosiaris) [13:40:43] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Make backup::set effectively a virtual resource [puppet] - 10https://gerrit.wikimedia.org/r/347388 (owner: 10Alexandros Kosiaris) [13:41:27] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343438 (https://phabricator.wikimedia.org/T144458) (owner: 10Catrope) [13:42:18] (03Merged) 10jenkins-bot: Enable RCFilters beta feature on fawiki, ruwiki, trwiki, and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343438 (https://phabricator.wikimedia.org/T144458) (owner: 10Catrope) [13:42:30] (03CR) 10jenkins-bot: Enable RCFilters beta feature on fawiki, ruwiki, trwiki, and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343438 (https://phabricator.wikimedia.org/T144458) (owner: 10Catrope) [13:43:40] stephanebisson: 343438 is at mwdebug1002, please test and let me know if I can deploy to cluster [13:43:50] zeljkof: testing... [13:44:00] stephanebisson: I did not run the scripts yet, should I do that now, or after deploying to cluster? [13:44:22] zeljkof: it can be done at the same time [13:45:22] !log Updating all Jenkins jobs using the git plugin due to JJB change cdfeb7bf66b0eacfed3eaf2a77813d65ab0e29f2 - https://phabricator.wikimedia.org/T162674 [13:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:58] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [13:47:59] zeljkof: looks good [13:48:24] stephanebisson: ok, should I deploy to cluster first, or run the scripts first? [13:48:44] zeljkof: the script, probably [13:49:18] PROBLEM - puppet last run on scb2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[puppet] [13:49:20] !log roll-upgrade swift to 2.2.0 across eqiad machines - T162609 [13:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:27] T162609: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609 [13:49:31] stephanebisson: ok, I seldom run scripts, this is what I need to do? [13:49:38] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:49:44] stephanebisson: zfilipin@terbium:~$ mwscript initUserPreference.php -s ores-enabled -t oresHighlight [13:49:48] PROBLEM - puppet last run on mw2130 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:49:51] (03CR) 10Ottomata: [C: 032] Install jessie by default on stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/347608 (owner: 10Ottomata) [13:49:53] how do I specify which wikis to run it for? [13:49:56] (03PS2) 10Ottomata: Install jessie by default on stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/347608 [13:50:00] (03CR) 10Ottomata: [V: 032 C: 032] Install jessie by default on stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/347608 (owner: 10Ottomata) [13:50:04] zeljkof: you also need --wiki fawiki [13:50:08] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:50:10] zeljkof: mwscript foobar.php --wiki fawiki [13:50:18] PROBLEM - puppet last run on mw2172 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:50:38] PROBLEM - puppet last run on mw2223 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[puppet] [13:51:02] stephanebisson, hashar: thanks, so: zfilipin@terbium:~$ mwscript initUserPreference.php -s ores-enabled -t oresHighlight --wiki fawiki [13:51:08] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[puppet] [13:51:12] !log upgrade puppet agent to 3.8 across the jessie fleet. Do that in a stages, starting with parsoid hosts [13:51:14] and then do the same for ruwiki and trwiki? [13:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:20] zeljkof: yes [13:51:31] stephanebisson: ok, doing [13:51:32] zeljkof: --wiki has to be the first parameter iirc [13:51:33] so [13:51:44] hashar: oh, thanks [13:51:47] mwscript initUserPreference.php --wiki fawiki REST OF PARAMS [13:52:01] good to know [13:52:01] (03PS1) 10Gehel: maps - add new dummy passwords to follow refactoring to role / profile [labs/private] - 10https://gerrit.wikimedia.org/r/347610 [13:52:08] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:16] hashar, stephanebisson: so [13:52:20] zfilipin@terbium:~$ mwscript initUserPreference.php --wiki fawiki -s ores-enabled -t oresHighlight [13:52:29] zeljkof: yes [13:53:08] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [13:53:21] !log upgrade puppet agent to 3.8 across the jessie fleet. Do that in a stages, starting with parsoid hosts. move on to mw fleet next. T162462 [13:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:27] T162462: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462 [13:53:55] (03CR) 10Gehel: [C: 032] maps - add new dummy passwords to follow refactoring to role / profile [labs/private] - 10https://gerrit.wikimedia.org/r/347610 (owner: 10Gehel) [13:53:58] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [13:54:01] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3171422 (10Aklapper) The three translations by @Janak_bhatta and @Rameshti seem to differ - could you agree on the translations? [13:54:01] !log reimaging stat1004 as jessie [13:54:03] (03CR) 10Gehel: [V: 032 C: 032] maps - add new dummy passwords to follow refactoring to role / profile [labs/private] - 10https://gerrit.wikimedia.org/r/347610 (owner: 10Gehel) [13:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:09] PROBLEM - puppet last run on mw2255 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[puppet] [13:54:10] (03CR) 10WMDE-Fisch: [C: 031] Enable TwoColConflict BetaFeature on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347582 (https://phabricator.wikimedia.org/T162370) (owner: 10Addshore) [13:54:35] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 15User-Elukey: Reimage the Hadoop Cluster to Debian Jessie - https://phabricator.wikimedia.org/T160333#3171426 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['stat1004.eqiad.wmnet'] ``` The... [13:54:39] (03CR) 10Zfilipin: "script output:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343438 (https://phabricator.wikimedia.org/T144458) (owner: 10Catrope) [13:54:57] stephanebisson: no problems with scripts ^ [13:55:00] deploying to cluster [13:55:09] great [13:55:38] PROBLEM - puppet last run on mw2248 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[puppet] [13:55:56] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:343438|Enable RCFilters beta feature on fawiki, ruwiki, trwiki, and frwiki (T144458)]] (duration: 00m 39s) [13:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:03] T144458: Launch ERI RC page features as a Beta Feature to all wikis - https://phabricator.wikimedia.org/T144458 [13:56:26] stephanebisson: deployed, please check production and thanks for deploying with #releng :) [13:56:28] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:56:38] PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:56:39] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [13:57:08] zeljkof: Thanks! [13:57:15] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:59:15] RECOVERY - puppet last run on mw1288 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [13:59:45] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [14:00:15] PROBLEM - puppet last run on elastic2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:00:16] RECOVERY - puppet last run on elastic2001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:00:28] ottomata: is this yours? ^^^ unmerged puppet change [14:00:35] PROBLEM - puppet last run on restbase2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[puppet] [14:02:35] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:04:27] (03PS1) 10Alexandros Kosiaris: Shortcircuit profile::backup::host to be usable always [puppet] - 10https://gerrit.wikimedia.org/r/347612 [14:06:25] (03PS2) 10Alexandros Kosiaris: Shortcircuit profile::backup::host to be usable always [puppet] - 10https://gerrit.wikimedia.org/r/347612 [14:06:35] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:07:45] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:08:35] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:08:53] <_joe_> that is bogus ^^ on mc1001 [14:09:05] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:09:09] volans: merging the change, we are reimaging and I think we'll need to do it again :( [14:09:23] <_joe_> did we change something somewhere? [14:09:33] (03PS3) 10Jcrespo: Indicate install recipes for newest db1* and db2* DB servers [puppet] - 10https://gerrit.wikimedia.org/r/346580 (https://phabricator.wikimedia.org/T162159) [14:09:45] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [14:10:25] 06Operations: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401#3171462 (10MoritzMuehlenhoff) These are fully rolled out: e2fsprogs ca-certificates pam [14:10:33] <_joe_> akosiaris: are you updating puppet across the fleet? [14:10:45] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [14:10:54] <_joe_> message: "Could not autoload puppet/util/instrumentation/listeners/log: Class Log is already defined in Puppet::Util::Instrumentation" [14:11:06] (03PS1) 10BBlack: traffic: route esams via codfw [puppet] - 10https://gerrit.wikimedia.org/r/347613 [14:11:16] should I stop merging puppet? [14:11:34] (03PS1) 10Giuseppe Lavagetto: etcd::client::config: properly handle settings [puppet] - 10https://gerrit.wikimedia.org/r/347614 [14:11:35] RECOVERY - puppet last run on scb2005 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [14:12:05] (03PS1) 10BBlack: traffic: depool eqiad from user traffic [dns] - 10https://gerrit.wikimedia.org/r/347616 [14:13:29] (03PS7) 10Gehel: [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [14:14:15] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [14:15:35] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [14:17:05] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:17:15] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:17:15] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [14:17:18] (03CR) 10Jcrespo: [C: 032] Indicate install recipes for newest db1* and db2* DB servers [puppet] - 10https://gerrit.wikimedia.org/r/346580 (https://phabricator.wikimedia.org/T162159) (owner: 10Jcrespo) [14:17:45] RECOVERY - puppet last run on mw2130 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [14:17:45] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:18:00] !log upgrading restbase1007 to Linux 4.9 [14:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:15] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [14:18:35] RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [14:18:45] RECOVERY - puppet last run on mw2223 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [14:18:53] (03PS3) 10Andrew Bogott: slapd conf: Allow for unlimited paged searches [puppet] - 10https://gerrit.wikimedia.org/r/346790 [14:19:05] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [14:19:35] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:19:51] (03PS8) 10Gehel: [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [14:20:35] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [14:20:41] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3171490 (10hashar) [14:21:15] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [14:22:16] RECOVERY - puppet last run on mw2255 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [14:24:19] RECOVERY - puppet last run on mw2248 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [14:24:35] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:25:35] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:25:45] PROBLEM - HP RAID on ms-be1035 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [14:27:35] RECOVERY - puppet last run on restbase2008 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [14:28:15] RECOVERY - puppet last run on elastic2010 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [14:29:33] Hi! [14:29:48] (03PS9) 10Gehel: [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [14:29:55] 06Operations, 10Cassandra, 06Services (done): RAID-0 volume not mounted on restbase-dev1001.eqiad.wmnet - https://phabricator.wikimedia.org/T162614#3171494 (10Eevans) 05Resolved>03Open >>! In T162614#3170616, @elukey wrote: > @Eevans thansk a lot for the details, I had no idea that these manual steps sho... [14:30:10] Could anyone please tell me how to ask for my sysop flag to be removed? [14:30:35] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [14:30:35] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [14:31:08] !log powercycled restbase1007, stuck during reboot [14:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:24] 06Operations, 10Cassandra, 06Services (done): RAID-0 volume not mounted on restbase-dev1001.eqiad.wmnet - https://phabricator.wikimedia.org/T162614#3171504 (10Eevans) >>! In T162614#3171494, @Eevans wrote: >>>! In T162614#3170616, @elukey wrote: [ ... ] >> Maybe worth to check partman's recipe and/or to up... [14:36:35] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:37:45] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:38:35] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:38:38] (03PS2) 10Giuseppe Lavagetto: role::puppet_compiler: fix conftool config [puppet] - 10https://gerrit.wikimedia.org/r/347614 [14:39:05] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:39:45] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [14:41:08] (03CR) 10Giuseppe Lavagetto: [C: 032] role::puppet_compiler: fix conftool config [puppet] - 10https://gerrit.wikimedia.org/r/347614 (owner: 10Giuseppe Lavagetto) [14:41:14] (03PS3) 10Giuseppe Lavagetto: role::puppet_compiler: fix conftool config [puppet] - 10https://gerrit.wikimedia.org/r/347614 [14:43:53] elukey: Not for Wikidata [14:44:21] Is there anyone who I can talk to regarding this in Ops? [14:45:05] I think that _joe_, akosiaris and paravoid might be good point of contacts [14:45:35] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:45:37] Amir1: is there any task or documentation that explain in more detail the issue? [14:45:38] (for the context, we are talking about this: https://gerrit.wikimedia.org/r/#/c/347395/) [14:45:43] yeah [14:45:51] they are scattered in several places [14:46:08] <_joe_> Amir1: that is not gonna happen before we did the switchover [14:46:08] https://phabricator.wikimedia.org/T151681 [14:46:32] <_joe_> Amir1: also, I need to understand what the impact in terms of requests that will use the lockmanager compared to before [14:46:56] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3171530 (10MoritzMuehlenhoff) 05Open>03Resolved Marking this one as fixed, the dead lock hasn't reoccured. The rollout of the new packages will be hand... [14:46:58] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3171532 (10MoritzMuehlenhoff) [14:47:24] jouncebot: next [14:47:24] In 1 hour(s) and 12 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170411T1600) [14:47:38] _joe_: yeah. The number of locks won't be much probably ten (or twenty at the most) locks will be stored in redis [14:48:00] with the time out of two hours [14:48:17] there already working on test wikidata [14:48:23] *they are [14:48:27] <_joe_> yeah saw that [14:48:35] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3171536 (10Cmjohnson) @Marostegui | db1096 |a1 **(no available u space — pick another location)** | db1097|d1 **(No issues)** | db1098|a2 **(Will definitely need a decom se... [14:48:48] <_joe_> if that's really the number I expect that patch to be easily mergeable after the switchover freeze [14:49:11] okay [14:49:29] Thanks. When is the freeze finished? [14:49:31] to be clear, I think _joe_ means May 3rd rather than Apr 26th [14:49:32] 20th? [14:49:35] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:49:39] okay [14:49:43] er, 24th sorry [14:49:52] the week of Apr 24th is not frozen [14:50:01] okay, that makes sense [14:50:11] but I'd still avoid making big changes before the May 3rd switchback [14:51:31] It's understandable [14:56:58] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3171602 (10Marostegui) Thanks @Cmjohnson, what about these changes: ``` db1096 - a6 db1098 - b5 db1099 - d3 ``` [14:58:35] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:00:18] 06Operations, 06Labs: Undo special tools-home and tools-project share definitions for NFS - https://phabricator.wikimedia.org/T161834#3171608 (10madhuvishy) > This should really be done in two parts: > > - refactoring so that the paths used in tools for the share links are common to the rest of the projects Th... [15:00:35] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:01:05] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:02:18] 06Operations, 10DBA: Decommission db1015, db1035, db1044 and db1038 - https://phabricator.wikimedia.org/T148078#3171633 (10jcrespo) [15:02:23] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3171634 (10jcrespo) [15:02:45] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [15:03:36] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3022062 keys, up 18 days 22 hours - replication_delay is 0 [15:05:33] !log upgrade cp4005 (cache_upload) to linux 4.9 T162029 [15:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:40] T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 [15:05:45] RECOVERY - HP RAID on ms-be1035 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:06:35] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:07:31] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3171675 (10RobH) >>! In T159838#3171078, @faidon wrote: > @Ottomata @RobH this seems to have been stalled somewhere between you two. Could you guys figure this and an... [15:07:39] 11 Apr 14:57:40.497 # I/O error trying to sync with MASTER: connection lost [15:07:45] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:07:48] this is from rdb2005:6479 [15:08:35] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:09:00] 06Operations, 10DBA: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3171683 (10jcrespo) [15:09:02] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3171696 (10Paladox) Workaround is download the debs from https://packages.debian.org/jessie/all/puppet/download [15:09:05] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:09:29] 06Operations, 10DBA: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3171683 (10jcrespo) Not for dc ops yet. [15:10:38] 06Operations, 10DBA: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3171704 (10jcrespo) [15:10:45] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:11:34] 06Operations, 10DBA: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3171683 (10jcrespo) [15:12:21] 06Operations, 10DBA: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3171709 (10Marostegui) [15:12:34] 06Operations, 10DBA: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3171710 (10jcrespo) [15:12:54] 06Operations, 10DBA: Decommission db1015, db1035, db1044 and db1038 - https://phabricator.wikimedia.org/T148078#3171712 (10jcrespo) [15:13:21] (03PS6) 10Madhuvishy: tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 (https://phabricator.wikimedia.org/T152235) (owner: 10Rush) [15:14:10] (03CR) 10jerkins-bot: [V: 04-1] tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 (https://phabricator.wikimedia.org/T152235) (owner: 10Rush) [15:15:35] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:18:59] 06Operations, 06Labs: Undo special tools-home and tools-project share definitions for NFS - https://phabricator.wikimedia.org/T161834#3171795 (10chasemp) By `refactoring so that the paths used in tools for the share links are common to the rest of the projects` I meant this :) > - Currently the mount paths fo... [15:19:04] !log upload scap 3.5.5-1 - T127762 [15:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:11] T127762: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762 [15:19:35] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:19:37] (03PS3) 10Filippo Giunchedi: Scap: update version to 3.5.5-1 [puppet] - 10https://gerrit.wikimedia.org/r/346579 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [15:19:47] jouncebot: next [15:19:47] In 0 hour(s) and 40 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170411T1600) [15:20:20] * thcipriani watches [15:21:09] (03CR) 10Filippo Giunchedi: [C: 032] Scap: update version to 3.5.5-1 [puppet] - 10https://gerrit.wikimedia.org/r/346579 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [15:21:09] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3171812 (10Cmjohnson) I need to check b5, it's a 24pt switch not 48. I believe there is 1 more available 1G port. [15:22:14] thcipriani: ^ merged [15:22:38] godog: awesome! Thank you! I'll check tin once puppet runs there [15:23:30] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3171835 (10Marostegui) >>! In T162233#3171812, @Cmjohnson wrote: > I need to check b5, it's a 24pt switch not 48. I believe there is 1 more available 1G port. If not, we ca... [15:23:54] thcipriani: yup I just kicked a puppet run there, scap is upgraded [15:24:00] * thcipriani checks [15:25:55] !log thcipriani@tin Synchronized README: test sync for new scap version 3.5.5 (duration: 00m 59s) [15:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:56] godog: all looks good, thanks for the update, I appreciate your help as always [15:27:32] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [15:27:37] thcipriani: no worries! glad it worked as expected [15:29:42] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 21 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [15:30:32] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:31:37] 06Operations, 10ops-codfw, 06Performance-Team, 15User-fgiunchedi: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3171888 (10fgiunchedi) a:05fgiunchedi>03Papaul Thanks @papaul ! I've copied the coal data off the usb drive, you can unplug it. I suppose once... [15:32:02] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:32:10] (03CR) 10Giuseppe Lavagetto: Properly handle inserting menu items with an explicit index (032 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/347573 (owner: 10Giuseppe Lavagetto) [15:33:38] (03PS10) 10Gehel: [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [15:34:42] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [15:36:33] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:36:52] (03PS3) 10Giuseppe Lavagetto: Properly handle inserting menu items with an explicit index [switchdc] - 10https://gerrit.wikimedia.org/r/347573 [15:36:54] (03PS3) 10Giuseppe Lavagetto: Suppress the SAN warnings for confctl [switchdc] - 10https://gerrit.wikimedia.org/r/347593 [15:37:37] <_joe_> volans: ^^ should be ok now? [15:37:42] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:38:01] _joe_: looking [15:38:33] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:38:57] <_joe_> akosiaris: ^^ it seems these are puppet runs on 3.7, but 3.8 is now installed [15:39:02] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:39:21] <_joe_> akosiaris: I do see root 19552 1 0 Apr05 ? 00:00:12 /usr/bin/ruby /usr/bin/puppet agent 0tv [15:39:32] <_joe_> that looks like someone did a typo [15:40:04] (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/347573 (owner: 10Giuseppe Lavagetto) [15:40:09] (03PS4) 10Volans: Properly handle inserting menu items with an explicit index [switchdc] - 10https://gerrit.wikimedia.org/r/347573 (owner: 10Giuseppe Lavagetto) [15:40:15] just rebased [15:40:42] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:43:06] (03PS1) 10Filippo Giunchedi: hieradata: allow codfw prometheus to talk to netmon eqiad [puppet] - 10https://gerrit.wikimedia.org/r/347622 (https://phabricator.wikimedia.org/T148541) [15:44:27] <_joe_> akosiaris: I also found all the hosts where that happened, if you're interested [15:45:33] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:47:04] (03PS2) 10Filippo Giunchedi: hieradata: allow codfw prometheus to talk to netmon eqiad [puppet] - 10https://gerrit.wikimedia.org/r/347622 (https://phabricator.wikimedia.org/T148541) [15:49:32] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:42] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:50:05] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: allow codfw prometheus to talk to netmon eqiad [puppet] - 10https://gerrit.wikimedia.org/r/347622 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [15:51:45] (03CR) 10Paladox: "@Alexandros Kosiaris did you run ppc?" [puppet] - 10https://gerrit.wikimedia.org/r/347518 (owner: 10Paladox) [15:52:09] (03PS9) 10Mobrovac: RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 (https://phabricator.wikimedia.org/T116335) [15:53:40] !log testing the codfw caches wipe+warm: https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Phase_4.1_-_Wipe_caches T160178 [15:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:48] T160178: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178 [15:54:21] !log switchdc (volans@sarin) START TASK - switchdc.stages.t04_cache_wipe(eqiad, codfw) wipe and warmup caches [15:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:32] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3171998 (10ayounsi) Talked to Chris and Brandon, we're going to aim for doing the work on Wednesday April 26. [15:56:56] !log switchdc (volans@sarin) END TASK - switchdc.stages.t04_cache_wipe(eqiad, codfw) Failed to execute [15:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:08] great! [15:58:42] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170411T1600). [16:00:32] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:01:02] (03PS5) 10Giuseppe Lavagetto: Properly handle inserting menu items with an explicit index [switchdc] - 10https://gerrit.wikimedia.org/r/347573 [16:01:02] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:01:04] (03PS4) 10Giuseppe Lavagetto: Suppress the SAN warnings for confctl [switchdc] - 10https://gerrit.wikimedia.org/r/347593 [16:01:06] (03PS1) 10Giuseppe Lavagetto: Fix path of script in cache_wipe [switchdc] - 10https://gerrit.wikimedia.org/r/347624 [16:01:11] no patches for puppet swat [16:01:30] (03CR) 10Chad: [C: 032] Scap clean: Log to IRC when we prune a branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347061 (owner: 10Chad) [16:01:43] (03PS2) 10Giuseppe Lavagetto: Fix path of script in cache_wipe [switchdc] - 10https://gerrit.wikimedia.org/r/347624 [16:02:46] (03Merged) 10jenkins-bot: Scap clean: Log to IRC when we prune a branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347061 (owner: 10Chad) [16:02:58] (03CR) 10jenkins-bot: Scap clean: Log to IRC when we prune a branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347061 (owner: 10Chad) [16:03:08] (03PS3) 10Giuseppe Lavagetto: Fix path of script in cache_wipe [switchdc] - 10https://gerrit.wikimedia.org/r/347624 [16:03:39] (03CR) 10Giuseppe Lavagetto: [C: 032] Properly handle inserting menu items with an explicit index [switchdc] - 10https://gerrit.wikimedia.org/r/347573 (owner: 10Giuseppe Lavagetto) [16:03:51] (03CR) 10Giuseppe Lavagetto: [C: 032] Suppress the SAN warnings for confctl (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/347593 (owner: 10Giuseppe Lavagetto) [16:04:34] !log demon@tin Synchronized scap/plugins/clean.py: syncing to both masters (duration: 00m 44s) [16:04:40] (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/347624 (owner: 10Giuseppe Lavagetto) [16:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:19] (03PS1) 10ArielGlenn: updated for support up through MW 1.29 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/347625 [16:05:21] (03PS1) 10ArielGlenn: add a sample script for importing to a local instance [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/347626 [16:05:34] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix path of script in cache_wipe [switchdc] - 10https://gerrit.wikimedia.org/r/347624 (owner: 10Giuseppe Lavagetto) [16:05:49] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Fix path of script in cache_wipe [switchdc] - 10https://gerrit.wikimedia.org/r/347624 (owner: 10Giuseppe Lavagetto) [16:06:32] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:07:23] (03PS1) 10ArielGlenn: last page range for page content job would sometimes have too many revs [dumps] - 10https://gerrit.wikimedia.org/r/347627 [16:08:15] !log testing the codfw caches wipe+warm, take 2 [16:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:22] (03CR) 10jerkins-bot: [V: 04-1] updated for support up through MW 1.29 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/347625 (owner: 10ArielGlenn) [16:08:28] (03CR) 10jerkins-bot: [V: 04-1] add a sample script for importing to a local instance [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/347626 (owner: 10ArielGlenn) [16:08:30] (03PS11) 10Dzahn: lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923 [16:08:31] !log switchdc (volans@sarin) START TASK - switchdc.stages.t04_cache_wipe(eqiad, codfw) wipe and warmup caches [16:08:33] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:02] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:10:33] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:11:06] !log switchdc (volans@sarin) END TASK - switchdc.stages.t04_cache_wipe(eqiad, codfw) Successfully completed [16:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:32] jynus: what I mean: If he needs to take an action at that task (e.g. modifying his scripts) why don't you assign the task to him? [16:13:24] 06Operations, 10hardware-requests: CODFW: (4) hardware access request for kubernetes - https://phabricator.wikimedia.org/T161700#3172067 (10RobH) a:03RobH I'm working on quotes in the #procurement S4 space for this request, since March 29th. [16:13:28] it is rude to do that [16:13:35] in my own opinion [16:13:47] 06Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 06Labs, 10hardware-requests: Eqiad: Hardware request for labstore1006/7, dataset1002/3 - https://phabricator.wikimedia.org/T161311#3128341 (10RobH) 05Open>03stalled a:03RobH I'm working on quotes in the #procurement S4 space for this... [16:13:58] _joe_: yeah I did the puppet upgrade across all jessies. See https://phabricator.wikimedia.org/T162462 [16:14:01] some people only get assinged things they are actually working on [16:14:19] but the puppet agent run you pasted is clearly a typo by someone [16:14:20] so I think pinging is the right way [16:14:35] in any case, Sagan, the issue has been already solved [16:14:56] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops: rack/setup frbackup2001 - https://phabricator.wikimedia.org/T162469#3172075 (10Papaul) [16:15:01] ah, ok [16:15:16] <_joe_> akosiaris: yeah it was on two hosts, fixed myself [16:15:33] _joe_: I am starting to love cumin even more [16:15:42] it was actually quite fun using it for that [16:15:46] <_joe_> akosiaris: you'll see, it's addictive [16:15:57] ahahah [16:16:13] (03PS1) 10Giuseppe Lavagetto: Correct parameter order for cache warmup script [switchdc] - 10https://gerrit.wikimedia.org/r/347629 [16:16:15] <_joe_> volans: ^^ [16:16:36] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Correct parameter order for cache warmup script [switchdc] - 10https://gerrit.wikimedia.org/r/347629 (owner: 10Giuseppe Lavagetto) [16:16:43] ok [16:16:45] but still [16:16:50] I'd like the command to fail [16:16:56] <_joe_> volans: yeah w/e [16:17:02] <_joe_> it's ok [16:17:04] <_joe_> for now [16:17:06] given that we have the catch in switchdc [16:17:21] <_joe_> we can fix that later :) [16:17:25] (03PS1) 10Filippo Giunchedi: Decommission prometheus[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/347630 (https://phabricator.wikimedia.org/T162712) [16:17:32] <_joe_> I agree, of course [16:17:32] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:17:50] yeah : [16:17:51] :) [16:18:53] (03PS11) 10Gehel: [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [16:19:38] phab is broken [16:19:42] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:20:39] paravoid: works for me [16:20:44] for me too [16:20:51] paravoid: what error do you get? [16:22:51] from another channel, seems to be on his end :) [16:22:52] (03CR) 10jerkins-bot: [V: 04-1] [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 (owner: 10Gehel) [16:23:11] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3172109 (10Dzahn) I read the instructions but i have questions: "remove the host from the round-robin DNS name specified in the Collection extension configuration, so it is no longer the target of new job requests... [16:23:17] (03CR) 10Filippo Giunchedi: [C: 032] Decommission prometheus[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/347630 (https://phabricator.wikimedia.org/T162712) (owner: 10Filippo Giunchedi) [16:24:46] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t04_cache_wipe(eqiad, codfw) wipe and warmup caches [16:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:00] (03PS1) 10Chad: Scap clean: exclude .git directories on first pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347633 [16:25:28] (03CR) 10Mobrovac: [C: 031] "Ok, this is now good to go. Cherry-picked and tested in BetaCluster" [puppet] - 10https://gerrit.wikimedia.org/r/347452 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [16:27:54] !log demon@tin Synchronized README: no-op, co-master sync (duration: 00m 43s) [16:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:42] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:28:51] !log restbase disabling puppet for T116335 [16:28:53] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [16:28:58] (03PS10) 10Alexandros Kosiaris: RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [16:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:58] T116335: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335 [16:29:01] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] RESTBase: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/347452 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [16:30:18] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3172165 (10akosiaris) The upgrade went fine on all the jessie hosts, now looking into how easy is to do trusty as well. [16:30:32] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:30:47] 06Operations, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947#3172171 (10MoritzMuehlenhoff) The update of Pango itself doesn't help on its own: I've generated a PNG with stock jessie and an updated Pango... [16:31:07] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:31:43] (03CR) 10Chad: [C: 032] donatewiki back to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347056 (https://phabricator.wikimedia.org/T162300) (owner: 10Chad) [16:32:32] !log mobrovac@tin Started deploy [restbase/deploy@e470b9f]: Dev Cluster: Initial Scap3 config deploy - T116335 [16:32:33] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t04_cache_wipe(eqiad, codfw) Successfully completed [16:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:27] (03PS1) 10Marostegui: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347634 [16:33:36] !log mobrovac@tin Finished deploy [restbase/deploy@e470b9f]: Dev Cluster: Initial Scap3 config deploy - T116335 (duration: 01m 04s) [16:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:44] mobrovac: success? [16:34:38] greg-g: partial :) [16:34:40] working on it [16:34:42] (03Merged) 10jenkins-bot: donatewiki back to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347056 (https://phabricator.wikimedia.org/T162300) (owner: 10Chad) [16:34:42] :) [16:34:54] (03CR) 10jenkins-bot: donatewiki back to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347056 (https://phabricator.wikimedia.org/T162300) (owner: 10Chad) [16:35:29] awight|unwork: Ok, here goes [16:35:40] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347634 (owner: 10Marostegui) [16:35:47] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: donatewiki back to wmf.19 [16:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:21] " tags cannot be used outside of normal pages. tags cannot be used outside of normal pages.From Wikipedia founder Jimmy Wales tags cannot be used outside of normal pages.//upload.wikimedia.org/wikipedia/donate/e/eb/Wikipedia-logo-Donate2.png tags cannot be used outside of normal pages." [16:36:26] Blah, nobody fixed anything [16:36:37] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:36:47] Rolling back [16:37:18] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: donatewiki still busted [16:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:41] Wait, I see the same on wmf.18 too [16:37:46] Hmm, what? [16:37:50] !log ppchelko@tin Started deploy [electron-render/deploy@5492cdb]: Update to latest upstream, canary on scb2001 T160764 [16:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:57] T160764: Update electron render service - https://phabricator.wikimedia.org/T160764 [16:38:16] (03CR) 10Alexandros Kosiaris: [C: 031] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn) [16:38:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347634 (owner: 10Marostegui) [16:38:29] RainbowSprinkles: hahaha. well meanwhile donatewiki didn't melt [16:38:37] (03PS12) 10Gehel: [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [16:38:37] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:38:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347634 (owner: 10Marostegui) [16:39:07] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:39:23] someone doing something at tin on mediawiki-staging? [16:39:31] I would need to deploy db-eqiad.php [16:40:07] PROBLEM - pdfrender on scb2001 is CRITICAL: connect to address 10.192.32.132 and port 5252: Connection refused [16:40:46] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t05_switch_traffic(eqiad, codfw) Switch traffic flow to the appservers in the new datacenter [16:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:18] there are uncommited changes in mediawiki-staging [16:41:25] modified: wikiversions.json [16:41:34] marostegui: RainbowSprinkles is [16:41:39] !log mobrovac@tin Started deploy [restbase/deploy@e470b9f]: Staging: Initial Scap3 config deploy - T116335 [16:41:41] see logs from demon above [16:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:47] T116335: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335 [16:42:04] oooh yay scap3 getting wider usage [16:42:15] !log ppchelko@tin Finished deploy [electron-render/deploy@5492cdb]: Update to latest upstream, canary on scb2001 T160764 (duration: 04m 28s) [16:42:21] greg-g: thanks, we are kind in a hurry to depool a database [16:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:34] apergos: the service we designed the tool requirements for is now migrating ;) [16:43:00] ugh, we need to de-couple that finally (db pooling in mw-config) [16:43:07] RECOVERY - pdfrender on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.075 second response time [16:43:13] !log mobrovac@tin Finished deploy [restbase/deploy@e470b9f]: Staging: Initial Scap3 config deploy - T116335 (duration: 01m 33s) [16:43:19] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t05_switch_traffic(eqiad, codfw) Successfully completed [16:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:25] RainbowSprinkles: marostegui needs to depool a db... [16:43:34] Tossed my change, go ahead [16:43:45] 06Operations, 10Electron-PDFs, 06Services, 13Patch-For-Review: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3083419 (10Pchelolo) The same thing has just happened when I've tried to update the service to a newer version (see T160764). Will... [16:43:48] (fucking around with a maybe-but-maybe-not busted wiki) [16:44:10] oh yes we do, greg-g, I would love it a lot [16:44:12] thanks [16:45:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1093 (duration: 00m 42s) [16:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:37] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [16:45:42] godog: joe sorry for being late for the puppet SWAT, but could you please still merge a simple one for me? https://gerrit.wikimedia.org/r/#/c/341833/ [16:46:02] marostegui: For future reference, if you didn't get a response to your ping is to `git stash` the uncommitted work, then proceed :) [16:46:18] RainbowSprinkles: Will do, thanks! :) [16:46:55] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for frbackup2001 [dns] - 10https://gerrit.wikimedia.org/r/347638 [16:46:58] I forgot a few words. Something about that being best practice :p [16:47:00] (03CR) 10Alexandros Kosiaris: [C: 032] Shortcircuit profile::backup::host to be usable always [puppet] - 10https://gerrit.wikimedia.org/r/347612 (owner: 10Alexandros Kosiaris) [16:47:06] (03PS3) 10Alexandros Kosiaris: Shortcircuit profile::backup::host to be usable always [puppet] - 10https://gerrit.wikimedia.org/r/347612 [16:47:07] Pchelolo: yeah I'll take a look [16:47:11] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Shortcircuit profile::backup::host to be usable always [puppet] - 10https://gerrit.wikimedia.org/r/347612 (owner: 10Alexandros Kosiaris) [16:48:00] thank you godog. It's been on the SWAT before, but seems it wasn't merged at that time [16:48:02] paladox: that one ^ should solve the issue you have in labs with profile::backup::host [16:48:10] paladox: could you please confirm ? [16:48:12] (03PS3) 10Filippo Giunchedi: PDFRender: Delay service shut-down to work around xpra race [puppet] - 10https://gerrit.wikimedia.org/r/341833 (https://phabricator.wikimedia.org/T159922) (owner: 10GWicke) [16:48:15] Ok thanks [16:48:18] will test now [16:48:36] !log Deploy unscheduled alter table on db1093 (adding pl_from index) [16:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:17] (03CR) 10Alexandros Kosiaris: [C: 032] "Yes. And stumbled across a bug in PCC. Which we need to fix. I think I 'll just merge this in order to not block on it." [puppet] - 10https://gerrit.wikimedia.org/r/347518 (owner: 10Paladox) [16:49:22] (03PS11) 10Alexandros Kosiaris: ircecho: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518 (owner: 10Paladox) [16:49:26] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ircecho: Convert to base::service class to maintain the script [puppet] - 10https://gerrit.wikimedia.org/r/347518 (owner: 10Paladox) [16:49:29] (03CR) 10Filippo Giunchedi: [C: 032] PDFRender: Delay service shut-down to work around xpra race [puppet] - 10https://gerrit.wikimedia.org/r/341833 (https://phabricator.wikimedia.org/T159922) (owner: 10GWicke) [16:49:36] (03PS4) 10Filippo Giunchedi: PDFRender: Delay service shut-down to work around xpra race [puppet] - 10https://gerrit.wikimedia.org/r/341833 (https://phabricator.wikimedia.org/T159922) (owner: 10GWicke) [16:49:44] merge wars! [16:49:47] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:49:54] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] PDFRender: Delay service shut-down to work around xpra race [puppet] - 10https://gerrit.wikimedia.org/r/341833 (https://phabricator.wikimedia.org/T159922) (owner: 10GWicke) [16:50:15] akosiaris thanks and puppet seems to work on gerrit-test3. Not sure if it is caching though. [16:51:13] Pchelolo: {{done}} I guess nobody was around for the patch at the previous puppet swat [16:52:13] Pchelolo: I'll force a puppet run on scb [16:52:16] thank you godog [16:52:24] passes on phabricator now :) [16:52:34] great, thank you :) [16:53:09] (03Abandoned) 10Paladox: Gerrit: Make backup optional [puppet] - 10https://gerrit.wikimedia.org/r/347189 (owner: 10Paladox) [16:53:13] (03Abandoned) 10Paladox: Phabricator: Make backup optional [puppet] - 10https://gerrit.wikimedia.org/r/347188 (owner: 10Paladox) [16:53:13] paladox: https://gerrit.wikimedia.org/r/#/c/347518/ worked fine. thanks! [16:53:24] ircecho on systemd, lol [16:53:30] Your welcome :) [16:53:39] I did not expect that to happen tbh [16:53:39] akosiaris i've been testing icinga 2.x [16:53:41] :-) [16:53:59] irc notifications did not work for me with just init [16:54:06] PROBLEM - Host prometheus1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:06] PROBLEM - Host prometheus1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:14] but worked with systemd [16:54:15] mutante gave me the idea to do it :) [16:54:23] weird [16:54:30] <_joe_> wat? [16:54:38] (03PS1) 10Mobrovac: RESTBase: Use the provided logging name and statsd alternative prefixes [puppet] - 10https://gerrit.wikimedia.org/r/347639 (https://phabricator.wikimedia.org/T116335) [16:54:42] we lost prometheus? [16:54:51] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: donatewiki back to wmf.19. you put your left foot in, you put your left foot out... [16:54:55] <_joe_> volans: I don't think so [16:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:12] cannot ping from sarin [16:55:22] <_joe_> volans: are those the active prometheus hosts? [16:55:25] <_joe_> I think they're not [16:55:44] node /^prometheus100[1234]\.eqiad\.wmnet$/ { [16:55:54] <_joe_> yeah [16:55:58] <_joe_> so wtf happened? [16:56:05] <_joe_> are those still on ganeti? [16:56:07] not sure if only the last 2 are the official ones [16:56:22] I just saw their icinga config being generated on einstenium [16:56:27] <_joe_> akosiaris: you know anything about prometheus1001/1002? [16:56:29] <_joe_> oh [16:56:35] !log Deploy unscheduled alter table on db1078 (s3, image table) - T160415 [16:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:42] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [16:56:43] _joe_: 3-4 are physical [16:56:46] 1-2 are ganeti [16:56:50] <_joe_> godog: any idea about that? [16:56:54] not sure if we already switched to the physical ones [16:56:56] can check [16:56:58] I have just decomissioned those, and puppet node deactivate / clean [16:57:10] not sure why icinga still thinks they should be there [16:57:21] godog: did you disable puppet on them ? [16:57:28] <_joe_> that ^^ [16:57:36] or shut them down [16:57:41] godog: cannot find the SAL of the depool either :D [16:57:46] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:58:00] a node that contacts a puppetmaster gets de-deactivated (pun intended) [16:58:01] (03Draft1) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [16:58:05] (03PS2) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [16:58:09] (03CR) 10Mobrovac: "PCC looking good - https://puppet-compiler.wmflabs.org/6121/" [puppet] - 10https://gerrit.wikimedia.org/r/347639 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [16:58:21] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3134452 (10nirajan_pant) >>! In T161529#3170855, @Janak_bhatta wrote: > Project name: विकिपिडिया > Project namespace: विकिपिडिया > Project talk namespace: विकिपिडिया_कुरडि Even if... [16:58:24] !log Deploy unscheduled alter table on db1077 (s3, image table) - T160415 [16:58:28] akosiaris: I didn't disable puppet but did gnt-instance delete, so perhaps puppet ran between that [16:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:41] (03PS3) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [16:58:48] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: Use the provided logging name and statsd alternative prefixes [puppet] - 10https://gerrit.wikimedia.org/r/347639 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [16:58:50] still, puppet node clean has unaccepted the cert so it shuoldn't have been able to run (puppet) [16:58:52] (03PS2) 10Alexandros Kosiaris: RESTBase: Use the provided logging name and statsd alternative prefixes [puppet] - 10https://gerrit.wikimedia.org/r/347639 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [16:58:55] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] RESTBase: Use the provided logging name and statsd alternative prefixes [puppet] - 10https://gerrit.wikimedia.org/r/347639 (https://phabricator.wikimedia.org/T116335) (owner: 10Mobrovac) [16:58:57] (03PS4) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [16:59:00] volans: yeah depool happened a while ago, not now [16:59:08] a while ago == two weeks ago [16:59:22] (03PS1) 10Chad: Revert "donatewiki back to wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347641 [16:59:25] that explains it :) [16:59:26] (03CR) 10Chad: [C: 032] Revert "donatewiki back to wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347641 (owner: 10Chad) [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170411T1700). Please do the needful. [17:00:10] !log Deploy unscheduled alter table on db1035 (s3, image table) - T160415 [17:00:12] Nothing for ORES [17:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:26] (03Merged) 10jenkins-bot: Revert "donatewiki back to wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347641 (owner: 10Chad) [17:00:36] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [17:00:37] (03CR) 10jenkins-bot: Revert "donatewiki back to wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347641 (owner: 10Chad) [17:01:06] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:01:08] akosiaris: so node deactivate again would fix it in practice? [17:01:31] godog: yes [17:01:37] (03PS13) 10Gehel: [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [17:01:40] !log ppchelko@tin Started deploy [electron-render/deploy@5492cdb]: Update to latest upstream, canary on scb2001, attempt#2 T160764 [17:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:47] T160764: Update electron render service - https://phabricator.wikimedia.org/T160764 [17:01:59] !log mobrovac@tin Started deploy [restbase/deploy@e470b9f]: Dev Cluster: Initial Scap3 config deploy, take 2 - T116335 [17:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:06] T116335: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335 [17:02:08] (03PS1) 10Ottomata: Include ores::base on hadoop clients and workers [puppet] - 10https://gerrit.wikimedia.org/r/347645 (https://phabricator.wikimedia.org/T162706) [17:02:25] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: nope, no wmf.19 for donatewiki. life is hard [17:02:25] !log Deploy unscheduled alter table on db1038 (s3, image table) - T160415 [17:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:36] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [17:02:39] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops, 13Patch-For-Review: rack/setup frbackup2001 - https://phabricator.wikimedia.org/T162469#3172311 (10Papaul) [17:02:57] !log mobrovac@tin Finished deploy [restbase/deploy@e470b9f]: Dev Cluster: Initial Scap3 config deploy, take 2 - T116335 (duration: 00m 58s) [17:02:59] RainbowSprinkles: still broen? :( [17:03:01] (03PS2) 10Ottomata: Include ores::base on hadoop clients and workers [puppet] - 10https://gerrit.wikimedia.org/r/347645 (https://phabricator.wikimedia.org/T162706) [17:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:04] broken* [17:03:07] legoktm: Yes [17:03:27] akosiaris: kk, thanks! that did it indeed [17:03:40] (03PS3) 10Ottomata: Include ores::base on hadoop clients and workers [puppet] - 10https://gerrit.wikimedia.org/r/347645 (https://phabricator.wikimedia.org/T162706) [17:03:50] legoktm: See scrollback in #-fundraising [17:04:08] PROBLEM - pdfrender on scb2001 is CRITICAL: connect to address 10.192.32.132 and port 5252: Connection refused [17:04:50] !log Deploy unscheduled alter table on db1015 (s3, image table) - T160415 [17:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:00] !log ppchelko@tin Finished deploy [electron-render/deploy@5492cdb]: Update to latest upstream, canary on scb2001, attempt#2 T160764 (duration: 03m 22s) [17:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:08] RECOVERY - pdfrender on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.082 second response time [17:05:08] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops, 13Patch-For-Review: rack/setup frbackup2001 - https://phabricator.wikimedia.org/T162469#3172320 (10Papaul) a:05Papaul>03Jgreen @Jgreen Complete let me know if you have any questions. [17:05:30] (03PS1) 10Filippo Giunchedi: Decommission prometheus[12]00[12] [dns] - 10https://gerrit.wikimedia.org/r/347647 (https://phabricator.wikimedia.org/T162712) [17:05:32] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#2912333 (10debt) this has been deployed and will need to wait until after the next deployment freeze to flip the switch. [17:05:37] !log mobrovac@tin Started deploy [restbase/deploy@e470b9f]: Staging: Initial Scap3 config deploy, take 2 - T116335 [17:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:07] !log Deploy unscheduled alter table on db1044 (s3, image table) - T160415 [17:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:28] PROBLEM - Check Varnish expiry mailbox lag on cp3039 is CRITICAL: CRITICAL: expiry mailbox lag is 588217 [17:06:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:38] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t04_cache_wipe(eqiad, codfw) wipe and warmup caches [17:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:58] (03CR) 10Ottomata: [C: 032] Include ores::base on hadoop clients and workers [puppet] - 10https://gerrit.wikimedia.org/r/347645 (https://phabricator.wikimedia.org/T162706) (owner: 10Ottomata) [17:07:17] 06Operations, 10ops-codfw, 06Performance-Team, 15User-fgiunchedi: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3172328 (10Papaul) a:05Papaul>03RobH @robh the flash drive has been removed from graphite2001 [17:07:27] 06Operations, 05Goal, 07kubernetes: Eliminate SPOFs in the existing eqiad infrastructure - https://phabricator.wikimedia.org/T162040#3150598 (10RobH) Task T161702 for the purchase of ganeti nodes in eqiad is being processed by me in the #procurement space & is projected to result in the ordering of 4 new gan... [17:07:50] !log mobrovac@tin Finished deploy [restbase/deploy@e470b9f]: Staging: Initial Scap3 config deploy, take 2 - T116335 (duration: 02m 12s) [17:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:57] T116335: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335 [17:08:38] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:47] !log restbase enabling back puppet for T116335 [17:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:08] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:09:43] (03PS14) 10Gehel: [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [17:12:00] (03PS2) 10Filippo Giunchedi: Decommission prometheus[12]00[12] [dns] - 10https://gerrit.wikimedia.org/r/347647 (https://phabricator.wikimedia.org/T162712) [17:12:34] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3172364 (10nirajan_pant) **About the language name:-** We have created some discussions on the name for this language. The discussions came to result which prefer use of 'Dotyali'... [17:12:57] 06Operations, 10Electron-PDFs, 06Services, 13Patch-For-Review: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3172367 (10Pchelolo) Although the patch was merged, the situation didn't change - the exact same log is produced on server restart... [17:14:10] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t04_cache_wipe(eqiad, codfw) Successfully completed [17:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:58] (03CR) 10Filippo Giunchedi: [C: 032] Decommission prometheus[12]00[12] [dns] - 10https://gerrit.wikimedia.org/r/347647 (https://phabricator.wikimedia.org/T162712) (owner: 10Filippo Giunchedi) [17:15:38] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:16:43] (03PS5) 10Alexandros Kosiaris: changeprop: Add an ores_uris parameter [puppet] - 10https://gerrit.wikimedia.org/r/345826 (https://phabricator.wikimedia.org/T159615) [17:17:00] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] changeprop: Add an ores_uris parameter [puppet] - 10https://gerrit.wikimedia.org/r/345826 (https://phabricator.wikimedia.org/T159615) (owner: 10Alexandros Kosiaris) [17:18:37] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops, 13Patch-For-Review: rack/setup frbackup2001 - https://phabricator.wikimedia.org/T162469#3172430 (10Papaul) The server is plugged in port 8 on pfw1-codfw [17:18:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:19:27] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347651 [17:19:31] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347651 [17:20:52] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347651 (owner: 10Marostegui) [17:21:53] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347651 (owner: 10Marostegui) [17:22:03] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347651 (owner: 10Marostegui) [17:23:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1093 (duration: 00m 57s) [17:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:52] (03PS3) 10Smalyshev: [cirrus] Increase max field count for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346542 (owner: 10DCausse) [17:25:38] PROBLEM - DPKG on thorium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:25:45] 06Operations, 06Labs: Investigate alternative RAID strategies for labstore1001/2 - https://phabricator.wikimedia.org/T162090#3172457 (10chasemp) If performance allows it would be great to get `RAID 50` esp since this is a 2 node HA cluster. We could finally do the beginnings of real (but limited) user backups. [17:26:38] RECOVERY - DPKG on thorium is OK: All packages OK [17:26:43] (03CR) 10Thcipriani: Scap clean: exclude .git directories on first pass (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347633 (owner: 10Chad) [17:26:45] (03CR) 10Ema: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/347378 (owner: 10Muehlenhoff) [17:27:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [17:29:04] (03PS15) 10Gehel: [WIP] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [17:30:39] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:30:52] !log mobrovac@tin Started deploy [restbase/deploy@e470b9f]: Initial Scap3 config deploy - T116335 [17:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:00] T116335: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335 [17:31:08] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:35:33] (03PS16) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [17:36:39] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:38:08] (03CR) 10Gehel: "This is starting to look reasonable. Puppet compiler seems happy (https://puppet-compiler.wmflabs.org/6125/). There are some minor differe" [puppet] - 10https://gerrit.wikimedia.org/r/347006 (owner: 10Gehel) [17:38:38] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:39:08] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:41:32] !log mobrovac@tin Finished deploy [restbase/deploy@e470b9f]: Initial Scap3 config deploy - T116335 (duration: 10m 39s) [17:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:39] T116335: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335 [17:44:58] PROBLEM - mediawiki-installation DSH group on mw2246 is CRITICAL: Host mw2246 is not in mediawiki-installation dsh group [17:45:38] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:46:45] (03PS1) 10Chad: Group0 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347652 [17:47:10] (03CR) 10Chad: [C: 04-2] "For later." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347652 (owner: 10Chad) [17:48:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:09] !log demon@tin Pruned MediaWiki: 1.29.0-wmf.17 [keeping static files] (duration: 00m 16s) [17:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:42] !log demon@tin Started scap: testwiki to wmf.20 to bootstrap [17:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:05] 06Operations, 10RESTBase, 13Patch-For-Review, 10Scap (Scap3-Adoption-Phase1), and 2 others: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335#3172576 (10mobrovac) [17:52:48] 06Operations, 10RESTBase, 10Scap (Scap3-Adoption-Phase1), 06Services (done), 15User-mobrovac: Deploy RESTBase with scap3 - https://phabricator.wikimedia.org/T116335#1746874 (10mobrovac) 05Open>03Resolved The switch has been fully completed! [17:55:32] (03Abandoned) 10Mobrovac: RESTBase: Send the logs locally to stdout/syslog [puppet] - 10https://gerrit.wikimedia.org/r/342103 (https://phabricator.wikimedia.org/T112648) (owner: 10Mobrovac) [17:58:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:00:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:02:08] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [18:03:53] someone playing on mira? [18:04:04] "Improperly owned -0:0- files in /srv/mediawiki-staging" [18:06:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:07:03] jynus: I haven't messed with mira, but I'm mid-scap right now? [18:07:11] Bad timing of check + rsync? [18:07:19] (temporary root ownership until it finished?) [18:08:38] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:08:42] could be, it is gone now [18:09:08] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:13:08] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [18:15:38] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [18:18:49] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:19:28] PROBLEM - HHVM jobrunner on mw1165 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:19:32] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3163952 (10jcrespo) I am commenting this here, please tell me if completely unrelated and I will create a new ticket: db1090 keeps failing to run puppet according to icinga since Apr... [18:25:09] !log demon@tin Finished scap: testwiki to wmf.20 to bootstrap (duration: 35m 27s) [18:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:28] RECOVERY - Check Varnish expiry mailbox lag on cp3039 is OK: OK: expiry mailbox lag is 161 [18:28:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [18:30:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [18:33:53] elukey@mw1165:~$ hhvmadm check-health [18:33:54] { "load":128 [18:33:54] , "queued":111 [18:34:54] !log restart hhvm on mw1165 (debug in /tmp/hhvm.5384.bt.) [18:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:18] RECOVERY - HHVM jobrunner on mw1165 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [18:36:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:37:45] * elukey is looking forward for hhvm 3.18 [18:38:38] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:39:08] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:45:38] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [18:48:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:54:23] (03PS2) 10ArielGlenn: add a sample script for importing to a local instance [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/347626 [18:58:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [19:00:04] RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170411T1900). Please do the needful. [19:00:11] choo choo [19:00:33] !log ppchelko@tin Started deploy [electron-render/deploy@5492cdb]: Update to latest upstream, canary on scb2001, attempt#3 T160764 [19:00:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [19:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:40] T160764: Update electron render service - https://phabricator.wikimedia.org/T160764 [19:01:08] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [19:01:22] !log ppchelko@tin Finished deploy [electron-render/deploy@5492cdb]: Update to latest upstream, canary on scb2001, attempt#3 T160764 (duration: 00m 52s) [19:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:55] (03CR) 10Chad: [C: 032] Group0 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347652 (owner: 10Chad) [19:03:15] (03Merged) 10jenkins-bot: Group0 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347652 (owner: 10Chad) [19:05:18] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=450.40 Read Requests/Sec=0.00 Write Requests/Sec=510.20 KBytes Read/Sec=0.00 KBytes_Written/Sec=12678.80 [19:06:18] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=25.20 Read Requests/Sec=0.30 Write Requests/Sec=28.70 KBytes Read/Sec=1.20 KBytes_Written/Sec=321.20 [19:06:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:08:21] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.20 [19:08:22] !log ppchelko@tin Started deploy [electron-render/deploy@5492cdb]: Update to latest upstream, full deploy, T160764 [19:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:35] T160764: Update electron render service - https://phabricator.wikimedia.org/T160764 [19:08:37] (03CR) 10jenkins-bot: Group0 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347652 (owner: 10Chad) [19:08:38] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:09:08] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:10:20] (03CR) 10jerkins-bot: [V: 04-1] add a sample script for importing to a local instance [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/347626 (owner: 10ArielGlenn) [19:11:58] !log ppchelko@tin Finished deploy [electron-render/deploy@5492cdb]: Update to latest upstream, full deploy, T160764 (duration: 03m 38s) [19:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:38] PROBLEM - pdfrender on scb1004 is CRITICAL: connect to address 10.64.48.29 and port 5252: Connection refused [19:15:38] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:18:38] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [19:18:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:18:58] !log ppchelko@tin Started deploy [electron-render/deploy@5492cdb]: Update to latest upstream, full deploy, attempt#2 T160764 [19:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:06] T160764: Update electron render service - https://phabricator.wikimedia.org/T160764 [19:20:21] !log ppchelko@tin Finished deploy [electron-render/deploy@5492cdb]: Update to latest upstream, full deploy, attempt#2 T160764 (duration: 01m 25s) [19:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:20] 06Operations, 10Cassandra, 06Services (blocked): Hyperthreading disabled on restbase2002.codfw.wmnet & restbase1015.codfw.wmnet - https://phabricator.wikimedia.org/T162735#3172972 (10Eevans) [19:25:36] 06Operations, 10Cassandra, 06Services (blocked): Hyperthreading disabled on restbase2002.codfw.wmnet & restbase1015.codfw.wmnet - https://phabricator.wikimedia.org/T162735#3173008 (10Eevans) @MoritzMuehlenhoff I know you are planning to bounce these machines as part of a kernel upgrade, any chance you could... [19:28:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [19:30:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:31:08] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:36:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:38:38] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:39:08] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:45:38] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:48:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:58:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [20:00:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [20:01:39] (03PS4) 10Andrew Bogott: slapd conf: Allow for unlimited paged searches [puppet] - 10https://gerrit.wikimedia.org/r/346790 [20:02:08] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:03:18] (03CR) 10Andrew Bogott: [C: 032] slapd conf: Allow for unlimited paged searches [puppet] - 10https://gerrit.wikimedia.org/r/346790 (owner: 10Andrew Bogott) [20:04:14] (03PS1) 10Mobrovac: RESTBase: Clean-up unused variables [puppet] - 10https://gerrit.wikimedia.org/r/347676 [20:06:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:07:07] (03PS1) 10Andrew Bogott: Revert "slapd conf: Allow for unlimited paged searches" [puppet] - 10https://gerrit.wikimedia.org/r/347677 [20:07:26] (03CR) 10Andrew Bogott: [C: 032] Revert "slapd conf: Allow for unlimited paged searches" [puppet] - 10https://gerrit.wikimedia.org/r/347677 (owner: 10Andrew Bogott) [20:07:30] (03CR) 10Andrew Bogott: [V: 032 C: 032] Revert "slapd conf: Allow for unlimited paged searches" [puppet] - 10https://gerrit.wikimedia.org/r/347677 (owner: 10Andrew Bogott) [20:07:44] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3173215 (10Dzahn) > Does this change in name will create any issue? Yes, kind of. The name has already been used in DNS commit, in Wikidata and in the blog. I really assumed the... [20:07:58] PROBLEM - Check systemd state on seaborgium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:08:03] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 237 bytes in 0.017 second response time [20:08:33] That ldap failure is my fault and should resolve in just a second [20:08:38] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:08:43] ok [20:08:58] RECOVERY - Check systemd state on seaborgium is OK: OK - running: The system is fully operational [20:09:03] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.363 second response time [20:09:08] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:09:19] hm, got paged for ldap and seaborgium seems to have had an issue [20:09:30] ah andrewbogott thanks [20:09:38] PROBLEM - Check systemd state on pollux is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:09:53] PROBLEM - Corp OIT LDAP Mirror on pollux is CRITICAL: Could not bind to the LDAP server [20:10:32] * apergos peeks in, does the backread, relaxes [20:11:09] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347679 (https://phabricator.wikimedia.org/T128546) [20:11:41] http://s2.quickmeme.com/img/70/70710c8aff156b84becbd522bea259023fb64be7369ec8c3bc9638f348284ab1.jpg [20:13:38] RECOVERY - Check systemd state on pollux is OK: OK - running: The system is fully operational [20:13:53] RECOVERY - Corp OIT LDAP Mirror on pollux is OK: LDAP OK - 0.110 seconds response time [20:14:29] why did oit one has issues, too? [20:14:54] oh, module patch [20:14:56] I see [20:15:38] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:17:59] 06Operations: allow paging to work properly in ldap - https://phabricator.wikimedia.org/T162745#3173245 (10Andrew) [20:18:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:19:56] namespace configuration is in commonsettings correct? [20:20:21] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3173260 (10Andrew) Fixing the puppetmaster issue requires changing (well, removing) the pinning in the puppet manifest, right? Is there a reason not to do that right away? [20:20:56] disregard [20:21:46] corp mirror and labs ldap use the same common openldap class [20:23:12] 06Operations, 10Cassandra, 06Services (blocked): Hyperthreading disabled on restbase2002.codfw.wmnet & restbase1015.codfw.wmnet - https://phabricator.wikimedia.org/T162735#3173263 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [20:24:23] (03CR) 10Mobrovac: [C: 031] "PCC looking good - https://puppet-compiler.wmflabs.org/6126/ . The differences that appear in config-vars.yaml due to this change are cove" [puppet] - 10https://gerrit.wikimedia.org/r/347676 (owner: 10Mobrovac) [20:28:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [20:29:53] 06Operations, 10hardware-requests: codfw/eqiad:(9+9) hardware access request for ORES - https://phabricator.wikimedia.org/T142578#3173272 (10RobH) 05Open>03stalled a:03RobH I have sub tasks filed for the actual quotation, so I'm moving this to pending impelementation, assigned to me, and stalled. [20:30:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [20:32:08] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:35:09] (03PS1) 10Andrew Bogott: nova-compute monitoring: Check for one /and only one/ nova-compute process [puppet] - 10https://gerrit.wikimedia.org/r/347688 (https://phabricator.wikimedia.org/T162640) [20:36:09] (03PS2) 10Andrew Bogott: nova-compute monitoring: Check for one and only one nova-compute process [puppet] - 10https://gerrit.wikimedia.org/r/347688 (https://phabricator.wikimedia.org/T162640) [20:36:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:38:07] 06Operations: allow paging to work properly in ldap - https://phabricator.wikimedia.org/T162745#3173294 (10Andrew) a:05Andrew>03MoritzMuehlenhoff [20:38:38] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:39:08] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:43:26] 06Operations, 10hardware-requests: eqiad: (4) worker servers for kubernetes - https://phabricator.wikimedia.org/T141624#3173314 (10RobH) [20:44:38] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [20:46:08] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:48:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:51:37] <_joe_> !log killed running 'puppet agent t-v' on ruthenium [20:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [21:00:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [21:02:08] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [21:02:15] 06Operations, 06Labs, 10procurement: eqiad: (2) hardware access request for labvirt1019 and labvirt1020 (refresh) - https://phabricator.wikimedia.org/T162486#3173351 (10RobH) [21:04:46] (03CR) 10Mobrovac: [C: 031] "Tested in BetaCluster as well, todo bueno" [puppet] - 10https://gerrit.wikimedia.org/r/347676 (owner: 10Mobrovac) [21:06:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:07:02] (03PS12) 10Dzahn: lists: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/346923 [21:09:08] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:11:03] !log mobrovac@tin Started deploy [restbase/deploy@a4042a6]: Dev cluster: Update the legal text in the API docs [21:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:12] (03PS5) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [21:12:19] (03PS6) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [21:12:40] !log mobrovac@tin Finished deploy [restbase/deploy@a4042a6]: Dev cluster: Update the legal text in the API docs (duration: 01m 37s) [21:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:56] !log mobrovac@tin Started deploy [restbase/deploy@a4042a6]: Staging: Update the legal text in the API docs [21:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:12] (03CR) 10Dzahn: [C: 032] "thanks for reviews, also ok in compiler: http://puppet-compiler.wmflabs.org/6127/fermium.wikimedia.org/" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn) [21:14:08] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [21:14:31] (03CR) 10Dzahn: "only change on fermium was the motd" [puppet] - 10https://gerrit.wikimedia.org/r/346923 (owner: 10Dzahn) [21:16:51] !log mobrovac@tin Finished deploy [restbase/deploy@a4042a6]: Staging: Update the legal text in the API docs (duration: 03m 55s) [21:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:08] !log mobrovac@tin Started deploy [restbase/deploy@a4042a6]: Update the legal text in the API docs [21:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:49] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:21:07] (03PS2) 10Dzahn: DNS: Add mgmt and production DNS for frbackup2001 [dns] - 10https://gerrit.wikimedia.org/r/347638 (owner: 10Papaul) [21:22:46] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt and production DNS for frbackup2001 [dns] - 10https://gerrit.wikimedia.org/r/347638 (owner: 10Papaul) [21:23:57] !log mobrovac@tin Finished deploy [restbase/deploy@a4042a6]: Update the legal text in the API docs (duration: 06m 49s) [21:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [21:30:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [21:31:08] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [21:36:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:39:08] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:40:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [21:40:55] where do i find the "Collection extension configuration". Not in mediawiki-config repo? [21:41:15] the extension itself is on github apparently [21:41:19] where is the config for it [21:41:47] "remove the host from the round-robin DNS name specified in the Collection extension configuration" [21:43:03] https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php ? [21:45:32] if ( $wmgUseCollection ) { [21:45:36] https://noc.wikimedia.org/conf/highlight.php?file=CommonSettings.php [21:46:58] $wgCollectionMWServeURL = $wmfLocalServices['ocg']; [21:47:07] 'ocg' => 'http://ocg.svc.eqiad.wmnet:8000', [21:48:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:51:20] Reedy: yea, i see that part, also: [21:51:20] wmf-config/ProductionServices.php: 'ocg' => 'http://ocg.svc.eqiad.wmnet:8000', [21:51:37] but what i need is to remove one host from that service name [21:52:04] like "ocg1001 is part of ocg.svc" [21:52:56] i'm trying to follow https://wikitech.wikimedia.org/wiki/OCG#Decommissioning_a_host [21:53:06] to take down one host of that pool [21:54:30] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3173455 (10Papaul) [21:56:05] if all it is is depooling it with confctl then i dont get why it says " remove the host from the round-robin DNS name specified in the Collection extension configuration" [21:56:37] but i guess that's what it means here and it's just the wording [21:58:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [22:01:08] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [22:04:13] RainbowSprinkles: is the train all done and okay? :) [22:05:08] (03PS2) 10Addshore: Enable TwoColConflict BetaFeature on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347582 (https://phabricator.wikimedia.org/T162370) [22:05:26] (03PS2) 10Addshore: Enable alternate RevSlider slider on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347583 (https://phabricator.wikimedia.org/T160410) [22:06:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:09:08] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:09:44] addshore: Been done for hours :) [22:10:10] RainbowSprinkles: any objection to me sneaking my 2 patches above out before the next swat so I can go to bed? ;) [22:12:48] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [22:12:50] Ummm [22:12:52] jouncebot: now [22:12:52] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [22:13:37] addshore: I suppose yeah [22:13:46] (03CR) 10Addshore: [C: 032] Enable TwoColConflict BetaFeature on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347582 (https://phabricator.wikimedia.org/T162370) (owner: 10Addshore) [22:13:49] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3005742 keys, up 19 days 5 hours - replication_delay is 51 [22:14:41] "ping on port 6479" sounds a bit off [22:14:56] (03Merged) 10jenkins-bot: Enable TwoColConflict BetaFeature on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347582 (https://phabricator.wikimedia.org/T162370) (owner: 10Addshore) [22:16:09] i wonder why ocg1003 doesn't do anything, zero CPU usage [22:16:21] while ocg1001/1002 are used [22:16:29] they are all pooled [22:17:15] (03CR) 10jenkins-bot: Enable TwoColConflict BetaFeature on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347582 (https://phabricator.wikimedia.org/T162370) (owner: 10Addshore) [22:17:44] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:347582|Enable TwoColConflict BetaFeature on fiwiki]] (duration: 00m 46s) [22:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:24] (03CR) 10Addshore: [C: 032] Enable alternate RevSlider slider on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347583 (https://phabricator.wikimedia.org/T160410) (owner: 10Addshore) [22:18:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:19:00] !log dzahn@puppetmaster1001 conftool action : set/pooled=no; selector: name=ocg1001.eqiad.wmnet [22:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:39] (03Merged) 10jenkins-bot: Enable alternate RevSlider slider on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347583 (https://phabricator.wikimedia.org/T160410) (owner: 10Addshore) [22:19:48] (03CR) 10jenkins-bot: Enable alternate RevSlider slider on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347583 (https://phabricator.wikimedia.org/T160410) (owner: 10Addshore) [22:20:40] aha, because it's disabled in Hiera.. but why [22:21:03] guess i'll try turning it on while 1001 needs repairs [22:22:46] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:347583|Enable alternate RevSlider slider on group0 T160410]] (duration: 00m 45s) [22:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:52] T160410: Make revision slider pointers more user-friendly - https://phabricator.wikimedia.org/T160410 [22:23:41] {{done}} [22:23:48] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 651 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3005742 keys, up 19 days 6 hours - replication_delay is 651 [22:25:42] (03PS1) 10Dzahn: ocg: enable ocg1003, disable ocg1001 [puppet] - 10https://gerrit.wikimedia.org/r/347781 (https://phabricator.wikimedia.org/T84723) [22:27:31] (03PS2) 10Dzahn: ocg: enable ocg1003, disable ocg1001 [puppet] - 10https://gerrit.wikimedia.org/r/347781 (https://phabricator.wikimedia.org/T84723) [22:27:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [22:30:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [22:31:08] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [22:32:38] (03CR) 10Dzahn: [C: 032] ocg: enable ocg1003, disable ocg1001 [puppet] - 10https://gerrit.wikimedia.org/r/347781 (https://phabricator.wikimedia.org/T84723) (owner: 10Dzahn) [22:36:34] !log ocg1003 started picking up jobs (mw-ocg-latexer) after it was enabled with gerrit:347781, ocg1001 was disabled in the same change. Also ganglia graphs confirm it. T84723 T161158 [22:36:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:42] T84723: reinstall OCG servers - https://phabricator.wikimedia.org/T84723 [22:36:42] T161158: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158 [22:39:08] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:43:48] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3005045 keys, up 19 days 6 hours - replication_delay is 0 [22:48:27] 06Operations, 10ops-eqiad, 13Patch-For-Review: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3173697 (10Dzahn) {F7495834} [22:48:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:51:33] (03PS1) 10Smalyshev: Enable deleted archive indexing & searching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347782 (https://phabricator.wikimedia.org/T109561) [22:53:34] @seen cmjohnson [22:53:34] mutante: Last time I saw cmjohnson they were changing the nickname to , but is no longer in channel #wikimedia-operations at 2/3/2017 2:59:03 PM (67d7h54m30s ago) [22:53:49] aha, the real cloak [22:54:48] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 658 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3005086 keys, up 19 days 6 hours - replication_delay is 658 [22:55:37] (03PS2) 10Smalyshev: Enable deleted archive indexing & searching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347782 (https://phabricator.wikimedia.org/T109561) [22:56:42] (03PS3) 10Smalyshev: Enable deleted archive indexing & searching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347782 (https://phabricator.wikimedia.org/T109561) [22:56:48] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3004586 keys, up 19 days 6 hours - replication_delay is 48 [22:57:04] elukey: i see you reinstalled mw2246 - icinga says "Host mw2246 is not in mediawiki-installation dsh group" should we add it? [22:57:16] (03CR) 10EBernhardson: [C: 031] Enable deleted archive indexing & searching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347782 (https://phabricator.wikimedia.org/T109561) (owner: 10Smalyshev) [22:57:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [23:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170411T2300). Please do the needful. [23:00:05] RoanKattouw, kaldari, and MaxSem: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:34] o/ [23:00:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [23:01:15] jouncebot ignores me :\ [23:01:48] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [23:02:08] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [23:02:49] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3004522 keys, up 19 days 6 hours - replication_delay is 0 [23:03:33] I can SWAT [23:03:48] yay [23:03:58] 06Operations, 10hardware-requests: EQIAD: 2 hardware access request for kubernetes-staging - https://phabricator.wikimedia.org/T162257#3173711 (10RobH) [23:04:00] !log ocg1001 - apt-get clean for disk space [23:04:01] (03PS2) 10Thcipriani: Enable Flow beta feature on frwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347419 (https://phabricator.wikimedia.org/T162022) (owner: 10Catrope) [23:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:15] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347419 (https://phabricator.wikimedia.org/T162022) (owner: 10Catrope) [23:06:37] (03Merged) 10jenkins-bot: Enable Flow beta feature on frwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347419 (https://phabricator.wikimedia.org/T162022) (owner: 10Catrope) [23:06:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:06:46] here [23:06:59] 06Operations, 10hardware-requests: EQIAD: 2 hardware access request for kubernetes-staging - https://phabricator.wikimedia.org/T162257#3157106 (10RobH) I've included the two hosts for this on procurement task T161724, since its identical to the ores hosts as stated in the request. [23:07:07] 06Operations, 05Goal, 07kubernetes: Design and implement a Kubernetes-based staging environment. (stretch) - https://phabricator.wikimedia.org/T162045#3173740 (10RobH) [23:07:09] 06Operations, 10hardware-requests: EQIAD: 2 hardware access request for kubernetes-staging - https://phabricator.wikimedia.org/T162257#3173738 (10RobH) 05Open>03stalled a:03RobH [23:07:11] (03CR) 10jenkins-bot: Enable Flow beta feature on frwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347419 (https://phabricator.wikimedia.org/T162022) (owner: 10Catrope) [23:07:16] RoanKattouw: you change is live on mwdebug1002, check please [23:07:22] *your [23:07:57] (03PS5) 10Thcipriani: Adding pageassessments.dblist for maintanence script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347468 (https://phabricator.wikimedia.org/T159438) (owner: 10Kaldari) [23:08:58] thcipriani: Works great [23:09:04] RoanKattouw: ok, going live [23:09:08] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:09:43] thcipriani: nothing to test for mine. [23:10:13] since nothing uses it yet [23:10:39] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347468 (https://phabricator.wikimedia.org/T159438) (owner: 10Kaldari) [23:10:46] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:347419|Enable Flow beta feature on frwikiversity]] T162022 (duration: 00m 46s) [23:10:46] kaldari: ok, will sync once it merges [23:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:52] T162022: Activate Flow as a Beta feature on French Wikiversity - https://phabricator.wikimedia.org/T162022 [23:10:54] ^ RoanKattouw live everywhere [23:11:23] !log ocg1001 - scheduled downtime in icinga for host and all services, confirmed it's not actively doign things anymore, shutting down for hardware replacement (T161158) [23:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:29] T161158: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158 [23:12:09] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3173752 (10Dzahn) [23:12:13] (03Merged) 10jenkins-bot: Adding pageassessments.dblist for maintanence script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347468 (https://phabricator.wikimedia.org/T159438) (owner: 10Kaldari) [23:12:27] (03CR) 10jenkins-bot: Adding pageassessments.dblist for maintanence script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347468 (https://phabricator.wikimedia.org/T159438) (owner: 10Kaldari) [23:13:32] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3123279 (10Dzahn) a:05Dzahn>03Cmjohnson See above, i have deactivated and shut down this host. Please replace the disk / fix the RAID and let me know when done so we can activate it again. You can do this anytim... [23:14:20] !log thcipriani@tin Synchronized dblists/pageassessments.dblist: SWAT: [[gerrit:347468|Adding pageassessments.dblist for maintanence script]] T159438 PART I (duration: 00m 45s) [23:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:27] T159438: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438 [23:15:42] !log thcipriani@tin Synchronized docroot/noc/conf/pageassessments.dblist: SWAT: [[gerrit:347468|Adding pageassessments.dblist for maintanence script]] T159438 PART II (duration: 00m 45s) [23:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:49] ^ kaldari all sync'd [23:17:34] MaxSem: ping for SWAT [23:17:40] pong [23:17:51] jan_drewniak, ^^^ [23:18:14] (03PS2) 10Thcipriani: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347679 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [23:18:21] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347679 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [23:18:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:20:25] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347679 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [23:20:34] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347679 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [23:21:41] MaxSem: jan_drewniak portals update is on mwdebug1002, check please [23:23:00] !log ocg: clearing host cache for ocg1001 which is shutdown for hardware repair. (on ocg1003: sudo -u ocg -g ocg nodejs-ocg /srv/deployment/ocg/ocg/mw-ocg-service/scripts/clear-host-cache.js -c /etc/ocg/mw-ocg-service.js ocg1001) T161158 [23:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:07] T161158: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158 [23:23:13] stats updated, the page seems to function fine [23:24:02] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3173767 (10Dzahn) To answer my own questions from above: >Once the DNS change has propagated and you've restarted OCG with the decommission configuration (restarting will wait for any existing jobs on that host to... [23:24:32] ok, running sync-portals script [23:25:14] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3173768 (10Dzahn) @cmjohnson See above, i have deactivated and shut down this host. Please replace the disk / fix the RAID and let me know when done so we can activate it again. You can do this anytime without furth... [23:26:53] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: [[gerrit:347679|Bumping portals to master]] T128546 (duration: 00m 46s) [23:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:00] T128546: [Recurring Task] Update Wikipedia.org Portal and sister Wiki's statistics - https://phabricator.wikimedia.org/T128546 [23:27:40] !log thcipriani@tin Synchronized portals: SWAT: [[gerrit:347679|Bumping portals to master]] T128546 (duration: 00m 46s) [23:27:43] debt, ^ [23:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [23:27:59] MaxSem: sync complete [23:28:34] thcipriani, thanks - looks good [23:28:56] cool, thanks for checking :) [23:29:29] MaxSem, thcipriani portals look good, thanks! [23:29:47] (03CR) 10Dzahn: "@Giuseppe: Ok with you? Then i would also change it in the "working example" code on the wikitech puppet coding page." [puppet] - 10https://gerrit.wikimedia.org/r/347023 (owner: 10Dzahn) [23:29:53] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347782 (https://phabricator.wikimedia.org/T109561) (owner: 10Smalyshev) [23:30:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [23:31:08] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [23:31:59] thcipriani: thanks [23:32:02] (03PS4) 10Thcipriani: Enable deleted archive indexing & searching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347782 (https://phabricator.wikimedia.org/T109561) (owner: 10Smalyshev) [23:32:18] (03CR) 10Thcipriani: Enable deleted archive indexing & searching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347782 (https://phabricator.wikimedia.org/T109561) (owner: 10Smalyshev) [23:32:24] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347782 (https://phabricator.wikimedia.org/T109561) (owner: 10Smalyshev) [23:33:12] SMalyshev: np :) (just have to fight with gerrit about rebasing stuff of course) [23:33:13] (03CR) 10Dzahn: [C: 031] "addressed joe's comment and compiled: http://puppet-compiler.wmflabs.org/6128/kraz.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/346926 (owner: 10Dzahn) [23:33:28] (03PS3) 10Dzahn: mw_rc_irc: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346926 [23:35:52] (03Merged) 10jenkins-bot: Enable deleted archive indexing & searching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347782 (https://phabricator.wikimedia.org/T109561) (owner: 10Smalyshev) [23:36:38] SMalyshev: patch is live on mwdebug1002, check please [23:36:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:37:02] thcipriani: not sure I can check it on mwdebug, since it's now only for command-line [23:37:12] so I can check it on terbium for example [23:37:15] (03CR) 10jenkins-bot: Enable deleted archive indexing & searching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347782 (https://phabricator.wikimedia.org/T109561) (owner: 10Smalyshev) [23:38:11] SMalyshev: okie doke, I fetched it over to terbium if there's anything you want to check there [23:38:22] (03CR) 10Dzahn: [C: 032] mw_rc_irc: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/346926 (owner: 10Dzahn) [23:39:08] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:39:42] (03CR) 10Dzahn: "nothing happened on "kraz" at all. total no-op" [puppet] - 10https://gerrit.wikimedia.org/r/346926 (owner: 10Dzahn) [23:48:49] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:49:10] thcipriani: checking [23:49:32] ok [23:54:06] thcipriani: seems to be wokring fine [23:54:31] SMalyshev: awesome, thanks for checking, I'll sync everywhere now [23:56:46] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:347782|Enable deleted archive indexing & searching]] T109561 PART I (duration: 00m 45s) [23:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:53] T109561: Add non-exact title search to Special:Undelete and corresponding API - https://phabricator.wikimedia.org/T109561 [23:58:08] !log thcipriani@tin Synchronized wmf-config/CirrusSearch-production.php: SWAT: [[gerrit:347782|Enable deleted archive indexing & searching]] T109561 PART II (duration: 00m 45s) [23:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:16] ^ SMalyshev live everywhere [23:58:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures