[00:01:58] (03CR) 10Alex Monk: [C: 04-1] Rename two namespaces at bswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) (owner: 10Luke081515) [00:04:15] (03CR) 10Alex Monk: [C: 04-1] "The other wikis seem to have index/appendix the other way around?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247088 (https://phabricator.wikimedia.org/T114458) (owner: 10Luke081515) [00:08:55] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [00:09:16] (03CR) 10Alex Monk: [C: 032] Modifying logo for anwiki per request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247253 (https://phabricator.wikimedia.org/T115841) (owner: 10MarcoAurelio) [00:09:42] (03Merged) 10jenkins-bot: Modifying logo for anwiki per request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247253 (https://phabricator.wikimedia.org/T115841) (owner: 10MarcoAurelio) [00:10:30] !log krenair@tin Synchronized w/static/images/project-logos/anwiki.png: https://gerrit.wikimedia.org/r/#/c/247253/ (duration: 00m 16s) [00:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:16:05] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:23:28] (03CR) 10Legoktm: "Why is this not setting $wgDefaultUserOptions in InitialiseSettings.php itself?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246171 (https://phabricator.wikimedia.org/T114982) (owner: 10Dereckson) [00:24:46] (03PS11) 10Madhuvishy: [WIP] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) [00:37:45] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [01:15:25] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [01:19:07] 10Ops-Access-Requests, 6operations: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1747267 (10Milimetric) @Dzahn, I'm not sure if you're familiar with the deployment tool that RESTBase is using, the command would be this: ansible-playbook -i production -... [01:38:55] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [01:46:05] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:07:44] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [02:16:36] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:36:47] !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 08m 04s) [02:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:38:06] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [02:41:25] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-23 02:41:25+00:00 [02:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:46:56] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:58:55] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [03:00:46] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [03:08:36] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [03:15:26] !log aaron@tin Synchronized php-1.27.0-wmf.3/includes/profiler/TransactionProfiler.php: 5ef4a91480ea (duration: 00m 18s) [03:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:15:45] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [03:34:31] 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1747389 (10bd808) [03:35:14] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [03:36:46] 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1747392 (10bd808) >>! In T87036#1640693, @hashar wrote: > Following on @Dzahn comment, should probably use Jessie instead of Trusty. I don't think that... [03:36:55] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [03:37:26] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [03:46:25] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [04:08:14] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [04:15:16] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [04:25:13] 6operations, 7Icinga, 7Monitoring: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API) - https://phabricator.wikimedia.org/T116099#1747412 (10akosiaris) 5Open>3Resolved a:3akosiaris https://gerrit.wikimedia.org/r/#/c/248036/ fixed it. Resolving [04:37:45] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [04:39:34] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:41:24] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1572 bytes in 9.278 second response time [04:46:56] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:56:04] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:01:27] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1572 bytes in 0.023 second response time [05:08:54] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [05:16:04] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:37:46] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [05:45:04] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [05:53:44] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Oct 23 05:53:43 UTC 2015 (duration 53m 42s) [05:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:57:04] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [05:58:54] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [06:08:35] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [06:15:45] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:29:35] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: puppet fail [06:30:46] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: puppet fail [06:30:54] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [06:31:36] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:25] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:35] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:55] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:36] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1747484 (10ori) p:5Normal>3High [06:37:35] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [06:48:58] (03CR) 10Muehlenhoff: "Looks good to me, but some comments inline." (037 comments) [debs/golang-burrow] (debian) - 10https://gerrit.wikimedia.org/r/248245 (https://phabricator.wikimedia.org/T116084) (owner: 10Ottomata) [06:55:20] !log ori@tin Synchronized php-1.27.0-wmf.2/extensions/AbuseFilter/AbuseFilterTokenizer.php: I65d4c6064: Track tokenizer cache hits / misses (duration: 00m 17s) [06:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:55:39] !log ori@tin Synchronized php-1.27.0-wmf.3/extensions/AbuseFilter/AbuseFilterTokenizer.php: I65d4c6064: Track tokenizer cache hits / misses (duration: 00m 18s) [06:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:56:55] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:04] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:06] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:14] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:44] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [07:25:55] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:27:15] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:28:35] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:55] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:37:34] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [07:42:29] (03PS4) 10Mobrovac: RESTBase: Set up the AQS public API [puppet] - 10https://gerrit.wikimedia.org/r/247935 (https://phabricator.wikimedia.org/T114830) [07:46:25] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:04:05] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 out: 300 virgin: 25) [08:08:16] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [08:11:37] (03CR) 10Alexandros Kosiaris: "I am just gonna +2 and merge this, cause technically it is fine, but I feel obliged to point out that instead of what could be a simple (i" [puppet] - 10https://gerrit.wikimedia.org/r/247935 (https://phabricator.wikimedia.org/T114830) (owner: 10Mobrovac) [08:11:42] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: Set up the AQS public API [puppet] - 10https://gerrit.wikimedia.org/r/247935 (https://phabricator.wikimedia.org/T114830) (owner: 10Mobrovac) [08:14:49] (03PS2) 10Muehlenhoff: Assign salt grains for mariadb::labs [puppet] - 10https://gerrit.wikimedia.org/r/248038 [08:15:26] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [08:15:57] (03PS2) 10Alexandros Kosiaris: RESTBase: Set up MobileApps storage [puppet] - 10https://gerrit.wikimedia.org/r/248026 (https://phabricator.wikimedia.org/T102130) (owner: 10Mobrovac) [08:16:04] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] RESTBase: Set up MobileApps storage [puppet] - 10https://gerrit.wikimedia.org/r/248026 (https://phabricator.wikimedia.org/T102130) (owner: 10Mobrovac) [08:16:44] PROBLEM - puppet last run on mw2103 is CRITICAL: CRITICAL: Puppet has 1 failures [08:18:58] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for mariadb::labs [puppet] - 10https://gerrit.wikimedia.org/r/248038 (owner: 10Muehlenhoff) [08:19:12] (03PS3) 10Muehlenhoff: Assign salt grains for mariadb::labs [puppet] - 10https://gerrit.wikimedia.org/r/248038 [08:19:13] (03CR) 10Muehlenhoff: [V: 032] Assign salt grains for mariadb::labs [puppet] - 10https://gerrit.wikimedia.org/r/248038 (owner: 10Muehlenhoff) [08:19:45] (03PS2) 10Muehlenhoff: Assign salt grains for osm [puppet] - 10https://gerrit.wikimedia.org/r/248039 [08:20:24] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [08:23:29] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for osm [puppet] - 10https://gerrit.wikimedia.org/r/248039 (owner: 10Muehlenhoff) [08:26:07] (03PS2) 10Muehlenhoff: Assign salt grains for the LVS servers [puppet] - 10https://gerrit.wikimedia.org/r/248040 [08:34:24] PROBLEM - Restbase endpoints health on xenon is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [08:35:08] ignore ^^ [08:35:15] PROBLEM - Restbase root url on xenon is CRITICAL: Connection refused [08:35:21] that too ^^ [08:37:05] RECOVERY - Restbase root url on xenon is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.013 second response time [08:37:55] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [08:38:33] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for the LVS servers [puppet] - 10https://gerrit.wikimedia.org/r/248040 (owner: 10Muehlenhoff) [08:39:05] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [08:42:05] RECOVERY - puppet last run on mw2103 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [08:46:15] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:58:35] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [09:00:26] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [09:04:04] (03CR) 10Giuseppe Lavagetto: "I think setting TCP_USER_TIMEOUT would be redundant in this specific case, since we already set variables that will give the same practica" [debs/pybal] - 10https://gerrit.wikimedia.org/r/244717 (https://phabricator.wikimedia.org/T113151) (owner: 10Ori.livneh) [09:04:27] (03CR) 10Ori.livneh: "Yes, I think you're right." [debs/pybal] - 10https://gerrit.wikimedia.org/r/244717 (https://phabricator.wikimedia.org/T113151) (owner: 10Ori.livneh) [09:04:37] _joe_: I read about it some more and reached the same conclusion [09:04:59] <_joe_> ori: ok! I was fearing I was missing something here [09:05:37] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/244717 (https://phabricator.wikimedia.org/T113151) (owner: 10Ori.livneh) [09:06:04] (03Merged) 10jenkins-bot: IdleConnection: set keepalive [debs/pybal] - 10https://gerrit.wikimedia.org/r/244717 (https://phabricator.wikimedia.org/T113151) (owner: 10Ori.livneh) [09:08:34] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [09:13:26] (03CR) 10Filippo Giunchedi: [C: 04-1] The varnish reqstats diamond collector does not work, emit to statsd directly instead (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [09:15:32] (03PS1) 10Muehlenhoff: Assign salt grains for swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/248298 [09:15:32] (03PS1) 10Muehlenhoff: Assign salt grains for swift backends [puppet] - 10https://gerrit.wikimedia.org/r/248299 [09:15:32] (03PS1) 10Muehlenhoff: Assign salt grains for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/248300 [09:15:32] (03PS1) 10Muehlenhoff: Assign salt grains for archiva [puppet] - 10https://gerrit.wikimedia.org/r/248301 [09:15:54] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:17:05] !log restbase rolling-restart after config changes [09:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:23:06] !log restarting schema change on geo_tags for all wikis [09:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:34:13] !log reimage restbase-test2002.codfw.wmnet [09:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:35:40] (03PS1) 10Alexandros Kosiaris: puppetmaster: /var/lib/puppet/ssl should be group puppet [puppet] - 10https://gerrit.wikimedia.org/r/248302 [09:36:40] (03PS1) 10Muehlenhoff: Assign salt grains for pmacct and puppetmaster backends [puppet] - 10https://gerrit.wikimedia.org/r/248303 [09:36:42] (03PS1) 10Muehlenhoff: Assign salt grains for snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/248304 [09:36:44] (03PS1) 10Muehlenhoff: Assign salt grains for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/248305 [09:38:34] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [09:39:19] (03PS1) 10Filippo Giunchedi: cassandra: add restbase-test2002 instances [puppet] - 10https://gerrit.wikimedia.org/r/248307 [09:41:37] (03PS2) 10Filippo Giunchedi: cassandra: add restbase-test2002 instances [puppet] - 10https://gerrit.wikimedia.org/r/248307 [09:41:39] (03PS1) 10Filippo Giunchedi: cassandra: add restbase-test2002-[ab] to seeds [puppet] - 10https://gerrit.wikimedia.org/r/248308 [09:42:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase-test2002-[ab] to seeds [puppet] - 10https://gerrit.wikimedia.org/r/248308 (owner: 10Filippo Giunchedi) [09:42:56] there are 5 million of items geotagged in commons, expect 1-5 seconds of lag during schema change on our least powerful servers [09:44:29] (03PS2) 10Muehlenhoff: Assign salt grains for swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/248298 [09:44:40] 1 million rows/s ? doesn't seem too shabby [09:45:07] well, I said 5 seconds of lag, the actual operation will take 1-5 minutes [09:45:33] but only becase it is done "inefficiently" to avoid locks and lag [09:45:55] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:45:55] (03CR) 10Alexandros Kosiaris: [C: 031] "9 304k whisper files per host for a total of 2.7MB per host. That amounts to around ~3G on the graphite host. I think this sounds reasonab" [puppet] - 10https://gerrit.wikimedia.org/r/247823 (owner: 10Alexandros Kosiaris) [09:46:19] ah ok that's more like it, ~20k rows/s then [09:46:49] (03CR) 10Filippo Giunchedi: "yup, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/247823 (owner: 10Alexandros Kosiaris) [09:47:11] we have very different hardware, performance varies widly between our 64G servers and the 160G ones [09:47:55] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/248298 (owner: 10Muehlenhoff) [09:48:37] even disk operations are like 5 times faster on the newest servers [09:49:00] (03PS2) 10Alexandros Kosiaris: diamond: enable ntpd collector across the fleet [puppet] - 10https://gerrit.wikimedia.org/r/247823 [09:49:51] (03PS2) 10Muehlenhoff: Assign salt grains for swift backends [puppet] - 10https://gerrit.wikimedia.org/r/248299 [09:50:29] just because of more memory / newer controllers ? [09:50:59] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for swift backends [puppet] - 10https://gerrit.wikimedia.org/r/248299 (owner: 10Muehlenhoff) [09:51:59] (03PS2) 10Muehlenhoff: Assign salt grains for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/248300 [09:52:14] (03CR) 10Alexandros Kosiaris: [C: 032] diamond: enable ntpd collector across the fleet [puppet] - 10https://gerrit.wikimedia.org/r/247823 (owner: 10Alexandros Kosiaris) [09:52:21] (03PS3) 10Alexandros Kosiaris: diamond: enable ntpd collector across the fleet [puppet] - 10https://gerrit.wikimedia.org/r/247823 [09:52:26] (03CR) 10Alexandros Kosiaris: [V: 032] diamond: enable ntpd collector across the fleet [puppet] - 10https://gerrit.wikimedia.org/r/247823 (owner: 10Alexandros Kosiaris) [09:53:08] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/248300 (owner: 10Muehlenhoff) [09:53:08] (03PS3) 10Muehlenhoff: Assign salt grains for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/248300 [09:53:08] (03CR) 10Muehlenhoff: [V: 032] Assign salt grains for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/248300 (owner: 10Muehlenhoff) [09:53:37] (03PS2) 10Muehlenhoff: Assign salt grains for archiva [puppet] - 10https://gerrit.wikimedia.org/r/248301 [09:54:14] PROBLEM - puppet last run on cp2019 is CRITICAL: CRITICAL: Puppet has 1 failures [09:54:21] db1042 may alert because of this, we do not cate about that [09:54:59] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for archiva [puppet] - 10https://gerrit.wikimedia.org/r/248301 (owner: 10Muehlenhoff) [09:55:32] (03PS2) 10Muehlenhoff: Assign salt grains for pmacct and puppetmaster backends [puppet] - 10https://gerrit.wikimedia.org/r/248303 [10:16:04] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [10:16:31] I created some deadlocks on itwiki and commons when applying the changes with geo-related-jobs- which it is semi-expected, and the logic of the jobs should be so that they are retried [10:20:34] RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:21:20] !log End of online schema change to geo_tags; all wikis on dblist have been updated [10:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:23:58] (03PS3) 10Filippo Giunchedi: cassandra: add restbase-test2002 instances [puppet] - 10https://gerrit.wikimedia.org/r/248307 [10:24:29] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase-test2002 instances [puppet] - 10https://gerrit.wikimedia.org/r/248307 (owner: 10Filippo Giunchedi) [10:27:41] !log retrying schema change on EventLogging (db1046) after failure [10:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:38:26] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [10:45:54] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [10:48:14] PROBLEM - puppet last run on mc2016 is CRITICAL: CRITICAL: puppet fail [10:49:09] mobrovac: ping me when you are back btw [11:08:17] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [11:11:02] https://logstash.wikimedia.org/#dashboard/temp/AVCUZF88ptxhN1Xa7psP [11:14:24] RECOVERY - puppet last run on mc2016 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [11:14:29] !log After days of no references on the logs to the renamed user_daily_contribs, deleting delete_user_daily_contribs table on all wikis [11:15:44] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [11:19:43] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1747776 (10fgiunchedi) restbase-test2002 is being converted to multi instance, currently bootstrapping `b` instance: ``` $ nodetool status -r Dat... [11:21:36] godog: back [11:22:54] mobrovac: hi! so basically https://gerrit.wikimedia.org/r/#/c/248313/ [11:23:16] * mobrovac looking [11:24:42] godog: i must be reading this wrong, does this blacklist *all* of these for *all* of the CFs? [11:24:58] mobrovac: it does [11:25:09] do we want that? [11:25:21] 5.6/10 is working nicely for massive table drops [11:26:58] mobrovac: I don't think we're looking at any of those, I didn't realize the mobileapps would create so many CFs or we'd have discussed it of course [11:27:54] on the other side- dropping 1000 tables from production shouldn't be so easy [11:28:07] mobrovac: IOW graphite jumped from 26k/s to 35k/s metrics with the new CFs https://grafana.wikimedia.org/dashboard/db/graphite-eqiad [11:28:30] godog: right, sorry for neglecting to mention the creation of 4 more CFs per KS [11:29:39] godog, would it make sense to divide eqiad and codfw, or would we prefer somethimg more transparent? [11:30:04] mobrovac: np, lesson learned :) [11:30:47] jynus: it might make sense if we want to keep things isolated, though the underlying problem remain (i.e. a single machine) [11:30:50] (in my mind that was one thing) [11:32:48] godog, with that you mean High availability, not load problemns, right? [11:33:28] mobrovac: anyways, the same blacklisting is already enabled on the test cluster and seems to be working [11:33:55] jynus: correct [11:33:57] godog: yeah, i've just done a round of our metrics and indeed didn't find any using these [11:35:13] mobrovac: ditto for alarms in icinga, there's a couple using 99percentile which isn't going to be blacklisted [11:35:24] (03CR) 10Mobrovac: [C: 031] "Let's go ahead to mitigate the impact. I've done a round of our metrics and we don't seem to be using these anyways." [puppet] - 10https://gerrit.wikimedia.org/r/248313 (https://phabricator.wikimedia.org/T113733) (owner: 10Filippo Giunchedi) [11:36:27] (03PS2) 10Filippo Giunchedi: cassandra: enable metric blacklist for restbase [puppet] - 10https://gerrit.wikimedia.org/r/248313 (https://phabricator.wikimedia.org/T113733) [11:36:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: enable metric blacklist for restbase [puppet] - 10https://gerrit.wikimedia.org/r/248313 (https://phabricator.wikimedia.org/T113733) (owner: 10Filippo Giunchedi) [11:36:40] mobrovac: cool! [11:37:54] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [11:42:16] 6operations, 7Database: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1747804 (10jcrespo) 5Open>3Resolved This table has been dropped from all dblist wikis. ``` mysql -A -h s1-master.eqiad.wmnet enwiki -e "SHOW TABLES like '%user_daily_contribs%'" [... [11:47:05] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:51:02] (03PS1) 10Filippo Giunchedi: graphite: enable labs instances archiver [puppet] - 10https://gerrit.wikimedia.org/r/248317 (https://phabricator.wikimedia.org/T111540) [11:51:34] !log roll-restart cassandra-metrics-collector on restbase cluster after https://gerrit.wikimedia.org/r/248313 [11:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:52:45] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: Puppet has 1 failures [12:00:30] (03PS1) 10Faidon Liambotis: Fix getJobQueue cronspam [puppet] - 10https://gerrit.wikimedia.org/r/248318 [12:00:53] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Fix getJobQueue cronspam [puppet] - 10https://gerrit.wikimedia.org/r/248318 (owner: 10Faidon Liambotis) [12:07:27] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [12:10:03] 6operations: Offer a solution to manage @toolserver.org mail redirections - https://phabricator.wikimedia.org/T116373#1747813 (10Dereckson) 3NEW [12:16:37] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:18:36] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [12:19:54] (03PS1) 10Mobrovac: RESTBase: Set the correct base path for the global domain [puppet] - 10https://gerrit.wikimedia.org/r/248319 [12:38:45] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [12:42:10] (03PS1) 10Muehlenhoff: Assign salt grains for labmon [puppet] - 10https://gerrit.wikimedia.org/r/248322 [12:42:10] (03PS1) 10Muehlenhoff: Assign salt grains for nova controller [puppet] - 10https://gerrit.wikimedia.org/r/248323 [12:42:10] (03PS1) 10Muehlenhoff: Assign salt grains for bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/248324 [12:43:14] oh, gerrit-wm choked on my serious of 17 patches [12:43:29] series [12:48:55] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: puppet fail [12:51:50] (03PS2) 10Mobrovac: RESTBase: Set the correct base path for the global domain [puppet] - 10https://gerrit.wikimedia.org/r/248319 [13:10:33] (03PS2) 10Muehlenhoff: Assign salt grains for labmon [puppet] - 10https://gerrit.wikimedia.org/r/248322 [13:13:08] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for labmon [puppet] - 10https://gerrit.wikimedia.org/r/248322 (owner: 10Muehlenhoff) [13:15:45] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [13:16:54] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:20:14] (03PS2) 10Muehlenhoff: Assign salt grains for nova controller [puppet] - 10https://gerrit.wikimedia.org/r/248323 [13:20:57] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for nova controller [puppet] - 10https://gerrit.wikimedia.org/r/248323 (owner: 10Muehlenhoff) [13:21:34] (03PS2) 10Muehlenhoff: Assign salt grains for bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/248324 [13:22:44] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/248324 (owner: 10Muehlenhoff) [13:30:10] (03PS1) 10Jcrespo: Preparing db2055-db2070 for jessie installation [puppet] - 10https://gerrit.wikimedia.org/r/248341 (https://phabricator.wikimedia.org/T84428) [13:30:10] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1747924 (10Ottomata) I'd like an actual timestamp to be part of the framing for all events too. I'm all for a reqid, (although I'd bikeshed about th... [13:31:51] (03PS2) 10Muehlenhoff: Assign salt grains for installserver [puppet] - 10https://gerrit.wikimedia.org/r/248325 [13:33:54] !log remove 15MinuteRate cassandra metrics after https://gerrit.wikimedia.org/r/#/c/248313/ [13:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:34:47] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for installserver [puppet] - 10https://gerrit.wikimedia.org/r/248325 (owner: 10Muehlenhoff) [13:36:38] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1747925 (10Jgreen) >>! In T97676#1739669, @awight wrote: > Confirmed that the campaign is intact. All the pipeli... [13:37:46] mobrovac: btw mobileapps looks like 5 new CF per KS, not 2.. local_group_wikimedia_T_mobileapps_mobil0pubCG21 local_group_wikimedia_T_mobileapps_mobil5e520k6J local_group_wikimedia_T_mobileapps_mobilIwqYh6Fi local_group_wikimedia_T_mobileapps_mobilnBU1P4_a local_group_wikimedia_T_mobileapps_mobilvf9Bf4_E [13:37:50] (03CR) 10Ottomata: Initial debian packaging (036 comments) [debs/golang-burrow] (debian) - 10https://gerrit.wikimedia.org/r/248245 (https://phabricator.wikimedia.org/T116084) (owner: 10Ottomata) [13:37:55] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [13:38:06] (03PS3) 10Ottomata: Initial debian packaging [debs/golang-burrow] (debian) - 10https://gerrit.wikimedia.org/r/248245 (https://phabricator.wikimedia.org/T116084) [13:38:15] (03PS2) 10Muehlenhoff: Assign salt grains for xenon [puppet] - 10https://gerrit.wikimedia.org/r/248326 [13:38:19] godog: ah yes, 5, not 4 as i stated before (forgot the one for mobile-text) [13:38:41] moritzm: should this package be called 'burrow' and not 'golang-burrow' since it is a daemon and not a library? [13:39:11] ottomata: yeah, I'm leaning towards burrow [13:39:28] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: Set the correct base path for the global domain [puppet] - 10https://gerrit.wikimedia.org/r/248319 (owner: 10Mobrovac) [13:39:34] (03PS3) 10Alexandros Kosiaris: RESTBase: Set the correct base path for the global domain [puppet] - 10https://gerrit.wikimedia.org/r/248319 (owner: 10Mobrovac) [13:39:41] (03CR) 10Alexandros Kosiaris: [V: 032] RESTBase: Set the correct base path for the global domain [puppet] - 10https://gerrit.wikimedia.org/r/248319 (owner: 10Mobrovac) [13:39:41] k, will do [13:39:43] most packages in the archive seem to use golang-foo for libs only [13:40:01] aye, just realized [13:40:06] oof, have to recreate the gerrit repo it hink [13:40:11] ah well :) [13:40:17] (03PS2) 10Rush: admin: allow all active users to be applied [puppet] - 10https://gerrit.wikimedia.org/r/244471 [13:41:07] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for xenon [puppet] - 10https://gerrit.wikimedia.org/r/248326 (owner: 10Muehlenhoff) [13:41:15] (03PS3) 10Muehlenhoff: Assign salt grains for xenon [puppet] - 10https://gerrit.wikimedia.org/r/248326 [13:41:30] thnx akosiaris! [13:41:51] (03CR) 10Muehlenhoff: [V: 032] Assign salt grains for xenon [puppet] - 10https://gerrit.wikimedia.org/r/248326 (owner: 10Muehlenhoff) [13:43:14] are unmerged changes to mira (mediawiki-config) supposedly to stay like that? [13:43:56] or is there a script/sync problem there? [13:44:19] (03PS1) 10Ottomata: Initial debian packaging [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/248342 (https://phabricator.wikimedia.org/T116084) [13:44:32] (03PS1) 10Ottomata: Initial debian packaging [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/248343 (https://phabricator.wikimedia.org/T116084) [13:45:46] !log remove 98percentile 999percentile meanRate stddev cassandra metrics after https://gerrit.wikimedia.org/r/#/c/248313/ [13:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:46:34] (03PS2) 10Jcrespo: Preparing db2055-db2070 for jessie installation [puppet] - 10https://gerrit.wikimedia.org/r/248341 (https://phabricator.wikimedia.org/T84428) [13:47:16] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:47:21] (03CR) 10Jcrespo: [C: 032] Preparing db2055-db2070 for jessie installation [puppet] - 10https://gerrit.wikimedia.org/r/248341 (https://phabricator.wikimedia.org/T84428) (owner: 10Jcrespo) [13:48:10] (03PS2) 10Muehlenhoff: Assign salt grains for maps [puppet] - 10https://gerrit.wikimedia.org/r/248327 [13:48:50] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for maps [puppet] - 10https://gerrit.wikimedia.org/r/248327 (owner: 10Muehlenhoff) [13:49:19] (03Abandoned) 10Ottomata: Initial debian packaging [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/248343 (https://phabricator.wikimedia.org/T116084) (owner: 10Ottomata) [13:50:00] (03PS1) 10Ottomata: Initial debian packaging [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/248344 (https://phabricator.wikimedia.org/T116084) [13:50:36] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1747951 (10mobrovac) >>! In T116247#1747924, @Ottomata wrote: > I'd like an actual timestamp to be part of the framing for all events too. I'm all f... [13:51:11] (03PS2) 10Muehlenhoff: Assign salt grains for Icinga [puppet] - 10https://gerrit.wikimedia.org/r/248328 [13:52:51] moritzm: this one should be good [13:52:51] https://gerrit.wikimedia.org/r/#/c/248344/ [13:55:41] !log Rebooting and installing jessie on db2055-db2059 [13:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:52] \o/ [13:57:01] ^carbon got updated by puppet, install2001 did not, not sure if that will work, but I will know instantly :-) [13:58:19] ottomata: k, I'll review that later the evening, need to run some errands soon [14:00:17] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for Icinga [puppet] - 10https://gerrit.wikimedia.org/r/248328 (owner: 10Muehlenhoff) [14:03:01] it did work, I suppose it is a dns-only change [14:04:15] (03PS2) 10Muehlenhoff: Assign salt grains for kafkatee [puppet] - 10https://gerrit.wikimedia.org/r/248329 [14:05:08] to switch to jessie? dhcp would need to reload, though afaik it runs on carbon only [14:06:02] yes, I was wrong about that, expecting changes on the install host [14:06:58] puppet reloads dhcp automatically [14:07:20] one day I will study all hosts and submodules we have, the 400 we have [14:07:45] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [14:12:40] ACKNOWLEDGEMENT - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail daniel_zahn new install [14:12:41] !log restbase rolling-restarting after applying aab840f [14:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:06] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:16:36] PROBLEM - puppet last run on restbase2001 is CRITICAL: CRITICAL: puppet fail [14:17:36] jynus: did you get a reply re. mira? [14:17:47] I think I didn't [14:18:13] jynus: well https://gerrit.wikimedia.org/r/#/c/224313/ in sort [14:18:49] until that is merged and scap syncs both masters; only tin is used and mira isn't maintained well (just like the apparent third puppet master :) ) [14:19:29] ok, then I will ack the issue pointing to the bug report [14:19:38] assuming mira is not used currently [14:20:50] what was the question about mira? [14:20:50] it shouldn't be used for deployments at least [14:21:01] It can't properly be used for deployments yet, I've tested it [14:21:15] Krenair: unmerged changes to /srv/mediawiki-staging alerts [14:21:18] it is ok, I was just asking because there was an alert [14:21:31] https://gerrit.wikimedia.org/r/#/c/247965/ is needed before deployments from mira can work at all [14:21:51] https://gerrit.wikimedia.org/r/#/c/224313/ should be done before people start actively using it [14:21:59] yes, that alert is normal [14:22:01] if people are aware of it, there is no need to stress [14:22:09] :-) [14:22:15] It's because I recently brought it up to date to test, and then it got out of date again [14:24:46] ACKNOWLEDGEMENT - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 9 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). Jcrespo https://phabricator.wikimedia.org/T104826 [14:26:57] thanks jynus [14:27:13] !log remove restbase-test2001 restbase-test2002 cassandra metrics [14:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:32:41] (03PS3) 10Luke081515: Rename two namespaces at bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) [14:37:26] RECOVERY - puppet last run on restbase2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:37:45] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [14:39:15] PROBLEM - Disk space on graphite2001 is CRITICAL: DISK CRITICAL - free space: /var/lib/carbon 36657 MB (3% inode=97%) [14:39:35] disk space on graphite? [14:39:37] Hrm [14:39:50] Oh, 2001. [14:44:09] an odyssey alright [14:44:16] (looking) [14:47:05] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:47:52] !log remove outdated cassandra metrics from graphite2001 [14:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:50:24] RECOVERY - Disk space on graphite2001 is OK: DISK OK [14:56:35] hmmm godog, yt? [14:56:41] ottomata: sure [14:56:53] sooo, statsd pipeline, cool! [14:56:56] but, i'm trying it now [14:57:03] and it doesn't seem to report the metrics as consistently [14:57:11] see https://graphite.wikimedia.org/ [14:57:25] 6operations, 6Analytics-Engineering: Install docker on analytics1004.eqiad.wmne - https://phabricator.wikimedia.org/T116383#1748053 (10Dzahn) [14:57:29] test.reqstats.misc.client [14:57:34] that's the prefix i'm sending right now [14:57:35] from cp1057 [14:57:47] but running at interval of 2 seconds for the last 10 minutes [14:57:56] client.get only has upper? [14:58:03] there's no client.total? [14:59:54] ottomata: I'm seeing client.total and client.method.get (both with .upper) [15:00:28] ja but why only upper? [15:00:36] 6operations, 6Analytics-Engineering: Install docker on analytics1004.eqiad.wmne - https://phabricator.wikimedia.org/T116383#1748059 (10Krenair) [15:00:38] head only has mean? [15:00:52] there's no stats.2xx [15:00:55] status.2xx* [15:01:05] ottomata: possibly because graphite is busy creating a ton of metrics for cassandra, so it might take a while to converge [15:01:08] oh hm. [15:01:09] oh ok [15:01:33] ottomata: see also "creates" under "misc" https://grafana.wikimedia.org/dashboard/db/graphite-eqiad [15:01:58] (03PS5) 10Luke081515: Enable four new namespaces at thwikitionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247088 (https://phabricator.wikimedia.org/T114458) [15:03:01] (03CR) 10Ori.livneh: "Oops. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/248318 (owner: 10Faidon Liambotis) [15:04:08] 6operations, 6Analytics-Engineering: Install docker on analytics1004.eqiad.wmne - https://phabricator.wikimedia.org/T116383#1748064 (10JAllemandou) Hi Guys, I was thinking to use a spare to test loading a 10G dataset into both Druid and ES, using docker to set up the services. Given the reasonnable size of the... [15:04:43] godog: ah I see, it is very up! [15:06:11] !log running nodetool cleanup on restbase-test2003 [15:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:55] ottomata: yeah that's likely why [15:07:35] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [15:12:58] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1748095 (10Ottomata) > So the producer would store the same time stamp twice? UUID v1 already contains it. Could you provide an example of what this... [15:14:09] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1748102 (10Ottomata) > topics named something like mw-edit and mw-edit-private perhaps (where the latter contains this extra info). I'd prefer if we... [15:14:55] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:17:16] (03PS2) 10Ottomata: The varnish reqstats diamond collector does not work, emit to statsd directly instead [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) [15:17:34] (03CR) 10Ottomata: The varnish reqstats diamond collector does not work, emit to statsd directly instead (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [15:17:49] (03CR) 10jenkins-bot: [V: 04-1] The varnish reqstats diamond collector does not work, emit to statsd directly instead [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [15:18:33] (03PS3) 10Ottomata: The varnish reqstats diamond collector does not work, emit to statsd directly instead [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) [15:29:16] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 3 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1748131 (10Ottomata) @ori, @mark, @gwicke, @kevinator I'm not really sure how to move this... [15:29:25] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 3 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1748133 (10Ottomata) a:3Ottomata [15:33:07] 6operations, 10ops-codfw, 7Swift: rack & initial on-site setup of ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1748147 (10fgiunchedi) @papaul, there will be some bios settings to change too, likely: (related {T112627}) * bios mode set to UEFI * power settings/profile ("power" ?) * all disks pre... [15:33:52] (03CR) 10Ottomata: [C: 031] Move base::firewall into the archiva role [puppet] - 10https://gerrit.wikimedia.org/r/245974 (owner: 10Muehlenhoff) [15:34:16] (03CR) 10Ottomata: [C: 031] Move the base::firewall include into the impala role [puppet] - 10https://gerrit.wikimedia.org/r/246221 (owner: 10Muehlenhoff) [15:39:04] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [15:40:09] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1748173 (10fgiunchedi) @cmjohnson there will be some bios settings to change too, likely (related {T112627}) * bios mode set to UEFI * power settings/profile ("power" ?) * all disks presented individ... [15:42:05] !log bouncing restbase-test2002-a Just To See [15:42:09] !log copied wmf-mariadb10 (10.0.16-2) .deb from trusty to jessie on apt.wikimedia.org [15:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:30] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1748180 (10Cmjohnson) yep, i got that! [15:44:08] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1748185 (10fgiunchedi) @cmjohnson, allocation plan looks like one each in A, B and D. If 10G ports are available let's go with that, otherwise 1G [15:46:16] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1748188 (10Cmjohnson) If you want a 10G port...then they need to go to specific racks but 10G in row A is very limited, non-existant in row B and pretty open in row D. [15:46:28] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:49:07] for some reason, reprepro copy doesn't work for me, I have to include the .deb from the pool :-/ [15:51:06] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1748238 (10fgiunchedi) ok, let's go for 1G where port availability is limited/none and 10G in row D [15:56:34] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1748244 (10JohnLewis) @selsharbaty-wmf This sort of went under radar lately with me so: Rename education-collab to education-collab-private; Create education-collab. Edu... [16:00:37] (03PS1) 10Alexandros Kosiaris: aqs: Fix the LVS monitoring [puppet] - 10https://gerrit.wikimedia.org/r/248354 [16:01:42] (03CR) 10Alexandros Kosiaris: [C: 032] aqs: Fix the LVS monitoring [puppet] - 10https://gerrit.wikimedia.org/r/248354 (owner: 10Alexandros Kosiaris) [16:03:04] (03PS1) 10Jcrespo: Adding new installed servers to the mariadb::core role [puppet] - 10https://gerrit.wikimedia.org/r/248355 (https://phabricator.wikimedia.org/T84428) [16:06:00] (03CR) 10Ottomata: "Since you haven't modified the email template nor the logging.cfg file from the package's versions, you don't need to manage these with pu" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) (owner: 10Madhuvishy) [16:06:20] ^wow, my last patch should not have been verified [16:06:23] (03CR) 10Ottomata: "You can remove the burrow.systemd file too." [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) (owner: 10Madhuvishy) [16:08:46] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [16:09:03] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 3 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1748274 (10GWicke) @ottomata, we have some hardware budget left in services that we could p... [16:09:21] (03PS2) 10Jcrespo: Adding new installed servers to the mariadb::core role [puppet] - 10https://gerrit.wikimedia.org/r/248355 (https://phabricator.wikimedia.org/T84428) [16:12:15] 6operations, 6Analytics-Kanban, 10Traffic, 5Patch-For-Review: Flag in x-analytics in varnish any request that comes with no cookies whatsoever {bear} [5 pts] - https://phabricator.wikimedia.org/T114370#1748283 (10Nuria) 5Open>3Resolved [16:14:44] !log starting rebuild of restbase-test2002-b [16:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:15:44] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.004 second response time [16:16:06] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:18:33] looking [16:18:44] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:19:12] godog: graphite.wikimedia.org is a 502 bad gateway [16:19:34] ori: indeed [16:19:40] oh oops [16:19:43] missed your message [16:20:15] !log bounce uwsgi for graphite-web on graphite1001 [16:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:20:34] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [16:20:50] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1748292 (10AKoval_WMF) Thanks for remembering this and us, @JohnLewis! I was just about to poke. ;) I do hope that we can resolve this matter soon. It is very frustratin... [16:20:52] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 3 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1748293 (10ori) We have hardware budget for this in performance, too. [16:22:08] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1748300 (10JohnLewis) Archives have stopped? Interesting, I'll look into that now. Once I get confirmation, I can do the rename today hopefully. [16:22:10] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 3 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1687618 (10Ottomata) Yea as long as one node can handle all the production traffic, the 2 i... [16:22:54] chasemp, Coren, should we skip our call today? I feel like the only action item is ‘continue setting up test cluster' [16:23:09] (which is probably all me and papaul at this point) [16:23:38] (03PS1) 10Bartosz Dziewoński: Revert "Hardcode UploadWizard max upload size" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248360 (https://phabricator.wikimedia.org/T98933) [16:24:20] (03CR) 10Bartosz Dziewoński: [C: 04-1] "(don't deploy this now)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248360 (https://phabricator.wikimedia.org/T98933) (owner: 10Bartosz Dziewoński) [16:26:03] godog: did we start pushing lots of new metrics to graphite? [16:26:13] !log uwsgi timing out while serving requests, bounce also carbon daemons [16:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:19] ori: we did this morning yeah [16:28:10] what do you think about https://gerrit.wikimedia.org/r/#/c/248355/ should I wait until monday? [16:28:47] chasemp: Coren (also I want to attend the tech talk that is at the same time) [16:28:53] jynus: do ittttt :P [16:29:10] ori, I do not take advice from you about deployment windows [16:29:14] :-D [16:29:20] wise [16:29:23] :P [16:29:37] * greg-g goes into a call [16:30:11] jynus: what's the risk? [16:30:11] I will be around for a while, worst case scenario, puppet breaks and I revert [16:30:19] yeah [16:30:21] I think none [16:30:24] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1572 bytes in 1.555 second response time [16:31:00] and that is the reason why I do not let puppet populate my mysqls and do anything clever [16:31:09] * greg-g nods [16:31:27] looks like we're back, but I'm not 100% sure what that was yet [16:32:13] (in fact, puppet is disabled on the new nodes, so it should be a null edit) [16:32:31] we really need a forward-looking strategy for metrics storage [16:32:34] jynus: you have convinced myself, thank you [16:33:04] ori, we talked about that on the offsite [16:33:42] there is a short-term strategy- but we didn't find a definitive answer long term [16:35:12] andrewbogott: didn't know we had a call today [16:35:14] (03PS3) 10Jcrespo: Adding new installed servers to the mariadb::core role [puppet] - 10https://gerrit.wikimedia.org/r/248355 (https://phabricator.wikimedia.org/T84428) [16:35:43] papaul: you’re not in that call usually, it’s just that everything that /that/ call is about is blocked by other stuff [16:35:54] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.009 second response time [16:35:55] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [16:36:28] andrewbogott: ok [16:36:35] (03CR) 10Jcrespo: [C: 032] Adding new installed servers to the mariadb::core role [puppet] - 10https://gerrit.wikimedia.org/r/248355 (https://phabricator.wikimedia.org/T84428) (owner: 10Jcrespo) [16:37:04] looking, heh, it feels like a heavy query/dashboard [16:37:44] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [16:38:04] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [16:38:22] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1748329 (10JanZerebecki) If we offer public access to the public events of the past we need to rewrite them according to new events that hide previou... [16:40:17] andrewbogott: yeah fine w/ me, could you outline a bit (even after the tech talk) how I can help w/ the test cluster? [16:40:29] sure [16:41:36] PROBLEM - puppet last run on wtp2012 is CRITICAL: CRITICAL: puppet fail [16:45:04] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1572 bytes in 0.013 second response time [16:46:02] puppet config is wong - adding wikimedia repositories is not a pre of installing mariadb :-( [16:47:13] jynus: puppet is never wrong! it is only outdated, poorly design or incomplete :) [16:47:14] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:47:23] !log stop puppet on graphite1001 and graphite-index cron, suspected root cause [16:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:49:05] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [500.0] [16:49:18] ori: main camps seem to be whisper files on disk (i.e. carbonate) vs completely different backend (i.e. cyanite or newts) [16:49:42] is influxdb not under consideration? [16:50:33] (03PS1) 10Cmjohnson: Adding mgmt dns for ms-be1019-1021 [dns] - 10https://gerrit.wikimedia.org/r/248367 [16:50:44] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:51:04] andrewbogott: No worries, we don't have anything new to discuss atm I think beyond "Mark approved the hardware" [16:51:13] 6operations: Christian Aistleitner - qchris shell access still active - https://phabricator.wikimedia.org/T116391#1748363 (10RobH) 3NEW a:3Tnegrin [16:51:28] ori: heh the 500s are back again, looking [16:51:55] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for ms-be1019-1021 [dns] - 10https://gerrit.wikimedia.org/r/248367 (owner: 10Cmjohnson) [16:52:01] 6operations: audit contractors sheet against cluster access - https://phabricator.wikimedia.org/T114430#1748386 (10RobH) p:5Triage>3High [16:52:32] 6operations: audit contractors sheet against cluster access - https://phabricator.wikimedia.org/T114430#1695180 (10RobH) [16:52:35] 6operations, 10Analytics: removed user handrade from access - https://phabricator.wikimedia.org/T114427#1748392 (10RobH) 5Open>3Resolved [16:55:00] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/248344 (https://phabricator.wikimedia.org/T116084) (owner: 10Ottomata) [16:56:34] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:56:44] 6operations: audit contractors sheet against cluster access - https://phabricator.wikimedia.org/T114430#1748410 (10RobH) line 120 of 226 [16:57:39] 6operations: Christian Aistleitner - qchris shell access still active - https://phabricator.wikimedia.org/T116391#1748413 (10Krenair) [16:57:41] 6operations, 6Security, 5Patch-For-Review, 7Security-General: determine validity of Christian Aistleitner (qchris's) shell account - https://phabricator.wikimedia.org/T104254#1748414 (10Krenair) [16:58:05] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1572 bytes in 9.584 second response time [16:58:05] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.72% of data above the critical threshold [1000.0] [16:58:37] godog: i'd say revert, at this point [17:00:48] ori: might an option, not sure how viable the new metrics are coming from new cassandra column families cc mobrovac gwicke urandom [17:01:09] anyways it did work fine from this morning (utc), I'm trying to pinpoint if it is a particular query [17:04:18] ACKNOWLEDGEMENT - MariaDB Slave IO: s1 on db2055 is CRITICAL: CRITICAL slave_io_state could not connect Jcrespo mysql is not running yet [17:04:24] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db2055 is CRITICAL: CRITICAL slave_sql_lag could not connect Jcrespo mysql is not running yet [17:04:31] ACKNOWLEDGEMENT - MariaDB Slave SQL: s1 on db2055 is CRITICAL: CRITICAL slave_sql_state could not connect Jcrespo mysql is not running yet [17:04:38] ACKNOWLEDGEMENT - MariaDB Slave IO: s1 on db2060 is CRITICAL: CRITICAL slave_io_state could not connect Jcrespo mysql is not running yet [17:04:45] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db2060 is CRITICAL: CRITICAL slave_sql_lag could not connect Jcrespo mysql is not running yet [17:04:53] ACKNOWLEDGEMENT - MariaDB Slave SQL: s1 on db2060 is CRITICAL: CRITICAL slave_sql_state could not connect Jcrespo mysql is not running yet [17:05:00] ACKNOWLEDGEMENT - mysqld processes on db2060 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Jcrespo mysql is not running yet [17:05:06] huh [17:05:08] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db2061 is CRITICAL: CRITICAL slave_sql_lag could not connect Jcrespo mysql is not running yet [17:05:15] ACKNOWLEDGEMENT - MariaDB Slave SQL: s1 on db2061 is CRITICAL: CRITICAL slave_sql_state could not connect Jcrespo mysql is not running yet [17:05:16] all those acks are paging ;] [17:05:23] ACKNOWLEDGEMENT - mysqld processes on db2061 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Jcrespo mysql is not running yet [17:05:25] ops [17:05:31] ACKNOWLEDGEMENT - mysqld processes on db2062 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Jcrespo mysql is not running yet [17:05:35] jynus: ouch [17:05:38] ACKNOWLEDGEMENT - MariaDB Slave IO: s1 on db2064 is CRITICAL: CRITICAL slave_io_state could not connect Jcrespo mysql is not running yet [17:05:40] <_joe_> de-check "send notification" [17:05:42] meh, no one gets paged when they are asleep (cept brandon and ariel who are both awake techncially) [17:05:46] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db2064 is CRITICAL: CRITICAL slave_sql_lag could not connect Jcrespo mysql is not running yet [17:05:48] jynus: all codfw I think? [17:05:53] ACKNOWLEDGEMENT - MariaDB Slave SQL: s1 on db2064 is CRITICAL: CRITICAL slave_sql_state could not connect Jcrespo mysql is not running yet [17:06:00] also they are new installs [17:06:01] ACKNOWLEDGEMENT - mysqld processes on db2064 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Jcrespo mysql is not running yet [17:06:02] so, false alarm [17:06:08] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db2065 is CRITICAL: CRITICAL slave_sql_lag could not connect Jcrespo mysql is not running yet [17:06:11] <_joe_> akosiaris: yup [17:06:15] ACKNOWLEDGEMENT - MariaDB Slave SQL: s1 on db2065 is CRITICAL: CRITICAL slave_sql_state could not connect Jcrespo mysql is not running yet [17:06:18] right now im really glad we fixed paging to all come from a single thread [17:06:20] =] [17:06:22] ACKNOWLEDGEMENT - MariaDB Slave IO: s1 on db2067 is CRITICAL: CRITICAL slave_io_state could not connect Jcrespo mysql is not running yet [17:06:28] single thread?! [17:06:30] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db2067 is CRITICAL: CRITICAL slave_sql_lag could not connect Jcrespo mysql is not running yet [17:06:32] <_joe_> were you waiting for a friday late party, weren't you? [17:06:32] single sender ID [17:06:37] every single one of thse is a new text message from a new sender [17:06:38] for me [17:06:38] ACKNOWLEDGEMENT - MariaDB Slave SQL: s1 on db2067 is CRITICAL: CRITICAL slave_sql_state could not connect Jcrespo mysql is not running yet [17:06:45] ACKNOWLEDGEMENT - MariaDB Slave IO: s1 on db2068 is CRITICAL: CRITICAL slave_io_state could not connect Jcrespo mysql is not running yet [17:06:48] ottomata: ok... lets take a look at your provider settings, what carrier? [17:06:49] robh: spoke too soon [17:06:52] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db2068 is CRITICAL: CRITICAL slave_sql_lag could not connect Jcrespo mysql is not running yet [17:06:53] tmobile [17:06:59] ACKNOWLEDGEMENT - MariaDB Slave SQL: s1 on db2068 is CRITICAL: CRITICAL slave_sql_state could not connect Jcrespo mysql is not running yet [17:07:07] ACKNOWLEDGEMENT - mysqld processes on db2068 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Jcrespo mysql is not running yet [17:07:08] ottomata: oh... i have the same and they all read from icinga... lemme check out the config differences between us [17:07:14] ACKNOWLEDGEMENT - mysqld processes on db2069 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Jcrespo mysql is not running yet [17:07:21] ACKNOWLEDGEMENT - MariaDB Slave IO: s1 on db2070 is CRITICAL: CRITICAL slave_io_state could not connect Jcrespo mysql is not running yet [17:07:25] <_joe_> so jynus how many pages should I expect? :) [17:07:28] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db2070 is CRITICAL: CRITICAL slave_sql_lag could not connect Jcrespo mysql is not running yet [17:07:30] same here... [17:07:35] around 200 or what ? [17:07:39] they all say icinga@neon.wikmiedia.org in the message [17:07:41] mine are all from "wikimedia" [17:07:44] but each one is a new number [17:07:44] thank goodness [17:07:47] wikimedia as well [17:07:47] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.011 second response time [17:07:50] 962-0 961-9 961-8 [17:07:51] etc. [17:08:07] ottomata: ohhh, that may be your phone being shitty [17:08:08] <_joe_> the graphite alarm is genuine, though [17:08:14] uhm.. [17:08:17] my phone sorts by the email sender not the gateway? [17:08:34] i got iphone 6 [17:08:45] ottomata: there's your problem :P [17:08:51] * akosiaris just joking [17:08:56] well, it seems iOS sorts in the sms app by the shortcode [17:09:03] and not by the passed along from field [17:09:07] RECOVERY - puppet last run on wtp2012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:09:13] this is just me assuming! [17:09:51] cuz we have hte same sender settings in our config [17:09:56] and you get the from email address info. [17:10:04] its likely the SMS app. [17:10:07] PROBLEM - mysqld processes on db2065 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [17:10:14] PROBLEM - MariaDB Slave SQL: s1 on db2070 is CRITICAL: CRITICAL slave_sql_state could not connect [17:10:26] moar!!! [17:10:33] PROBLEM - MariaDB Slave IO: s1 on db2069 is CRITICAL: CRITICAL slave_io_state could not connect [17:10:46] ottomata: unfortunately, we cannot just shift you to the EU solution. The US cellular companies won't take sender ID passalong from the service we use for the EU paging [17:10:55] PROBLEM - MariaDB Slave Lag: s1 on db2069 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:10:57] its why we use the US cellular provider gateways [17:11:03] moritzm: nothing to see here, no worries ;-) [17:11:14] PROBLEM - MariaDB Slave SQL: s1 on db2069 is CRITICAL: CRITICAL slave_sql_state could not connect [17:11:17] _joe_: indeed, I'm looking at it [17:11:21] PROBLEM - mysqld processes on db2070 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [17:11:28] this wouldn't happen if checks where disabled by default :-/ [17:11:28] PROBLEM - mysqld processes on db2067 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [17:11:32] hm, so no solution for poor little otto [17:11:35] PROBLEM - MariaDB Slave IO: s1 on db2062 is CRITICAL: CRITICAL slave_io_state could not connect [17:11:53] we can't really disable checks by default ... [17:12:02] PROBLEM - MariaDB Slave Lag: s1 on db2062 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:12:06] I 'll silence icinga for a few mins [17:12:22] PROBLEM - MariaDB Slave SQL: s1 on db2062 is CRITICAL: CRITICAL slave_sql_state could not connect [17:12:38] akosiaris: my ISP dropped my off for the last 10 mins, whatever there was, I didn't see it anyway :-) [17:13:08] !log enable_notifications=0 in neon's icinga for a few mins while the storm dies down [17:13:12] everithing is ok now [17:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:13:16] grrr [17:13:27] moritzm: I was joking about the pages [17:13:40] ah :-) [17:18:21] (03PS3) 10Giuseppe Lavagetto: Re-adding PyBalConfigurationObserverError [debs/pybal] - 10https://gerrit.wikimedia.org/r/244669 [17:18:22] the funny thing is I did not get a single sms [17:18:23] (03PS8) 10Giuseppe Lavagetto: Convert logging from print to twisted.python.log [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 [17:18:31] <_joe_> jynus: lol [17:18:31] 6operations, 6Labs, 10Labs-Infrastructure, 3labs-sprint-117: Allocate labs subnet in dallas - https://phabricator.wikimedia.org/T115491#1748532 (10RobH) [17:18:33] wat ? [17:18:37] that sounds wrong [17:18:42] no, that is wrong... [17:18:46] (03CR) 10jenkins-bot: [V: 04-1] Re-adding PyBalConfigurationObserverError [debs/pybal] - 10https://gerrit.wikimedia.org/r/244669 (owner: 10Giuseppe Lavagetto) [17:18:54] (03CR) 10jenkins-bot: [V: 04-1] Convert logging from print to twisted.python.log [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 (owner: 10Giuseppe Lavagetto) [17:19:01] dang, i got 30 or spomething [17:19:27] my last SMS notification is FR network glitch [17:19:47] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 3 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1748543 (10GWicke) Based on the data we got in labs (1300 events / s on a single-core labs... [17:19:51] <_joe_> another sms provider, another stream of awesomeness [17:20:41] in the actual text it does not say ACK like on IRC, it says CRIT, but the comment "jcrespo mysql is not installed yet" still told me it was ok [17:21:34] my apologies, I always uncheck the send notification, but I panicked and did it fast to race the CRITs [17:21:38] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1748552 (10Selsharbaty-WMF) Thanks for your dedication @JohnLewis. Please go ahead with your plan. We can never thank you enough! [17:22:01] no problem, the comment made the difference [17:22:32] I downtimed all the hosts 3 hours ago, but these where new checks because it is a reinstall [17:23:13] I am very sorry [17:23:27] yea, it's a common thing when new servers are installed. because they appear in icinga fully automatic after getting base puppet [17:23:33] dont worry [17:24:14] what about not disabling them by default, but starting them downtimed for X minutes, would do be better? [17:24:33] s/do/it/ [17:25:41] soo, i once talked about this with brandon and that was the idea, yea, schedule downtime for them before they are added. i made a script for scheduling downtimes [17:26:01] but the part that is lacking is some mechanism that does this automatically when new servers are installed [17:26:10] would schedule downtime for a service that doesn't exist work? [17:26:30] I actually used "downtime for this host and all its services" [17:26:45] i _think_ yes. it is just writing a matching service name into that command file .. [17:26:46] (03PS4) 10Giuseppe Lavagetto: Re-adding PyBalConfigurationObserverError [debs/pybal] - 10https://gerrit.wikimedia.org/r/244669 [17:26:48] (03PS9) 10Giuseppe Lavagetto: Convert logging from print to twisted.python.log [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 [17:26:53] (03PS1) 10Pmlineditor: Changed wgNamespacesToBeSearchedDefault for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248371 (https://phabricator.wikimedia.org/T114932) [17:27:11] (03CR) 10jenkins-bot: [V: 04-1] Re-adding PyBalConfigurationObserverError [debs/pybal] - 10https://gerrit.wikimedia.org/r/244669 (owner: 10Giuseppe Lavagetto) [17:27:18] (03CR) 10jenkins-bot: [V: 04-1] Convert logging from print to twisted.python.log [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 (owner: 10Giuseppe Lavagetto) [17:27:31] jynus: on "neon" /usr/local/bin/icinga-downtime [17:27:55] there's the inevitable race with that thing [17:28:05] that you need to schedule the downtime before the first alert [17:28:18] but after neon gets the configuration [17:28:41] neon may not need the config [17:29:03] not sure I follow [17:29:13] the way to schedule a downtime is like this: [17:29:22] 13 commandfile="/var/lib/nagios/rw/nagios.cmd" [17:29:31] 47 printf "[%lu] SCHEDULE_HOST_DOWNTIME;${hostname};${start_time};${end_time};1;0;${duration};${user};${reason}\n" $(date +%s) > $commandfile [17:29:32] (03PS5) 10Giuseppe Lavagetto: Re-adding PyBalConfigurationObserverError [debs/pybal] - 10https://gerrit.wikimedia.org/r/244669 [17:29:35] (03PS10) 10Giuseppe Lavagetto: Convert logging from print to twisted.python.log [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 [17:29:45] i think it may not care if the host is in the config files [17:29:52] it may be ok to just add it before [17:30:22] you mean if it will accept the command ? [17:30:28] it will, but then will do nothing with it [17:30:38] yea, it's just writing into that command file with print [17:30:45] and later it will try to match host names [17:30:48] no [17:30:56] it will just send it to /dev/null [17:31:16] if you try to do that from the interface you will get an unauthorized error [17:31:35] but from the command line, nothing happens [17:31:44] cause it has no way of returning anything back [17:31:47] ok,, hrm, then back to the race [17:32:26] there are ways out of the race [17:32:50] but it entails a human saying "We are OK, now you can pester us with notifications" [17:33:01] guess what will end up not happening [17:33:15] cause as always humans are the weakest link [17:36:09] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1748641 (10JohnLewis) a:5Selsharbaty-WMF>3JohnLewis [17:36:48] (03PS2) 10Luke081515: Changed wgNamespacesToBeSearchedDefault for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248371 (https://phabricator.wikimedia.org/T114932) (owner: 10Pmlineditor) [17:37:01] (03PS1) 10MaxSem: Beta: add cache headers to WP portal [puppet] - 10https://gerrit.wikimedia.org/r/248374 [17:39:47] akosiaris: puppet code in base that sets default contact group for icinga notiications, and an "if" around that that looks at the node's uptime.. only after i'm up for at least X, add the "sms" contact group ?:p [17:39:55] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1748667 (10GWicke) @ottomata, UUIDs are described in https://en.wikipedia.org/wiki/Universally_unique_identifier. An example for a v1 UUID is `b54adc... [17:42:32] yeah we had that before, but we ended up removing it again for some reason [17:42:46] (we had something in the puppet->icinga stuff that didn't add if it was too new in puppetdb?) [17:43:18] <_joe_> bblack: yeah, you wrote it in naggen2, then we had some problems with that, and we reverted it [17:43:32] <_joe_> don't ask me the details of "problems" [17:43:42] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:42] on the bright side- I only paged you once (in period of times), instead of n times, 1 per server (these are 16 new servers) [17:43:42] another option would be some way of having naggen2 notice whether there's ever been a non-failing full puppet run logged yet [17:43:52] <_joe_> but it was like "people freaking out because their alarms weren't showing up" [17:43:59] ebernhardson: how's the copy / nobelium stuff going? [17:47:14] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1748732 (10GWicke) @JanZerebecki: Suppression information would indeed be needed for public access to older events. One option would be to key this o... [17:49:06] 6operations, 7Database, 7Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#1748737 (10jcrespo) 3NEW a:3jcrespo [17:49:11] YuviPanda: copy to codfw is working its way around, i re-pinged you on the ticket to move the disks on nobelium to get that moving though. its still writing to the 84G disk instead of the 5T one [17:49:24] oh [17:49:25] damn [17:49:29] let me find iit [17:49:48] https://phabricator.wikimedia.org/T114856 [17:49:48] YuviPanda: i could probably just manually unmount/remount the disk, or edit the elasticsearch config to work around the issue [17:50:37] ebernhardson: I wonder if we should just edit the elasticsearch config. sounds the easiest thing to do... [17:50:40] YuviPanda: second thing, the copy ended up taking much more cpu than i expected (on the machine running php) so i can only run one copy at a time (either ->codfw, or ->labsearch). If we could pull the mediawiki code into nobelium i could use that though [17:50:42] ebernhardson: is it a hiera variable? [17:51:05] YuviPanda: well, i meant using `sudo vi ...` but that would probably be better :) [17:51:16] lemme look [17:51:17] tch tch sudo vi :P [17:51:38] * ebernhardson isn't so picky about hacking up test machines :P [17:51:42] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1748754 (10Dzahn) I'm here as a backup if help is needed since we are now using a script i made to replace the manual instructions that were kind of long and error prone.... [17:52:49] YuviPanda: looks like we dont set it for normal uses, its the `path.data` line in elasticsearch.yml [17:53:04] ebernhardson: hmm, ok let me do a remount then... [17:53:19] ebernhardson: and as for mediawiki on nobelium - which role should we apply? [17:53:34] 6operations, 10Wikidata, 7Database, 7Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#1748759 (10hoo) [17:53:37] not sure either :S looking [17:53:54] !log stopping elasticsearch on nobelium for https://phabricator.wikimedia.org/T114856 [17:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:54:06] ebernhardson: how will the jobrunners feel when I shut es down? [17:54:50] YuviPanda: they will re-queue the jobs (only testwiki is currently on) with a backoff, and drop them after 10 minutes [17:55:04] kkk [17:57:03] YuviPanda: would just applying role::mediawiki::common be enough to set it up? i'm worried some of these other ones might put nobelium into actual prod rotation and not just install mediawiki [17:57:10] (well, actually install scap so we can sync-common) [17:57:20] !log delete wikimedia-grid grafana dashboard (saved a copy first) heavy graphite queries [17:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:57:48] YuviPanda: otherwise, i suppose whatever role makes terbium have scap would work [17:58:25] YuviPanda: which is probably the role::mediawiki::common and then role::mediawiki::maintenance [17:58:30] !log moved mounts around on nobelium, mounted bigger disk on /var/lib/elasticsearch [17:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:59:32] ebernhardson: ok, the remount seems to have gone ok [17:59:37] sweet [17:59:43] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch inactive shards 1525 threshold =0.1% breach: status: red, number_of_nodes: 1, unassigned_shards: 1522, number_of_pending_tasks: 3, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 262, cluster_name: labsearch, relocating_shards: 0, active_shards: 262, initializing_shards: 3, number_of_data_nodes: 1, delayed_unassigne [17:59:52] ^ safe to ignore [18:00:00] ok [18:00:30] 6operations, 6Discovery, 6Labs, 7Elasticsearch: Mount /dev/md1 on nobelium to /var/lib/elasticsearch - https://phabricator.wikimedia.org/T114856#1748806 (10yuvipanda) [18:00:44] much better :) [nobelium] using [1] data paths, mounts [[/var/lib/elasticsearch (/dev/md1)]], net usable_space [5.3tb], net total_space [5.3tb], types [xfs [18:00:56] \o/ [18:00:58] cool [18:01:03] ebernhardson: ok, looking at those two roles now [18:02:30] (03PS1) 10Dzahn: admin: create group aqs-restbase-deployers [puppet] - 10https://gerrit.wikimedia.org/r/248378 (https://phabricator.wikimedia.org/T116169) [18:03:32] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: green, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1787, cluster_name: labsearch, relocating_shards: 0, active_shards: 1787, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [18:03:42] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1572 bytes in 6.206 second response time [18:05:59] (03PS1) 10Yuvipanda: elasticsearch: Setup partial MW install on nobelium [puppet] - 10https://gerrit.wikimedia.org/r/248381 [18:06:08] ebernhardson: ^ [18:06:43] (03CR) 10EBernhardson: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/248381 (owner: 10Yuvipanda) [18:07:24] (03PS2) 10Yuvipanda: elasticsearch: Setup partial MW install on nobelium [puppet] - 10https://gerrit.wikimedia.org/r/248381 [18:07:36] (03CR) 10Yuvipanda: [C: 032 V: 032] elasticsearch: Setup partial MW install on nobelium [puppet] - 10https://gerrit.wikimedia.org/r/248381 (owner: 10Yuvipanda) [18:08:11] ebernhardson: running puppet now [18:08:20] 6operations, 6Security, 5Patch-For-Review, 7Security-General: determine validity of Christian Aistleitner (qchris's) shell account - https://phabricator.wikimedia.org/T104254#1748845 (10Tnegrin) Thanks Krenair - qchris has my complete trust and support for any access he feels is appropriate. [18:09:13] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:09:15] (03PS12) 10Madhuvishy: [WIP] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) [18:09:23] PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: puppet fail [18:13:36] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1748867 (10Dzahn) @milimetric no, i'm not familiar with ansible deployment. the command line is very helpful. see my gerrit change above, that would be a... [18:16:14] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1686 bytes in 0.022 second response time [18:16:39] !log bounce grafana-server on krypton [18:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:16:52] (03CR) 10Yuvipanda: "Does this mean we're officially using ansible in production?" [puppet] - 10https://gerrit.wikimedia.org/r/248378 (https://phabricator.wikimedia.org/T116169) (owner: 10Dzahn) [18:18:40] (03CR) 10Dzahn: "i'm not sure. the ticket sounds like we are already doing so, just not formalized with an admin group, so root is needed for it" [puppet] - 10https://gerrit.wikimedia.org/r/248378 (https://phabricator.wikimedia.org/T116169) (owner: 10Dzahn) [18:22:04] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:56] ebernhardson: well scap seems to keep failing [18:23:05] YuviPanda: :S whats the eror? [18:23:19] lol? Notice: /Stage[main]/Mediawiki::Scap/Exec[fetch_mediawiki]/returns: Scap should not be run as root [18:23:29] :) [18:23:39] doing a third puppetrun now [18:23:41] let's see.... [18:23:52] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1686 bytes in 2.498 second response time [18:24:01] nope [18:24:18] bd808: thcipriani do you know if the 'scap should not be run as root' is new? [18:24:35] possibly [18:24:42] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1748907 (10JohnLewis) 5Open>3Resolved I have: * rename education-collab to education-collab-private * delete education-collab and its archives * created education-col... [18:24:46] mutante: ^^ [18:24:51] it should really run as mwdeploy I think [18:25:03] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:30] YuviPanda: yea, the patch si from ot 8 [18:25:32] oct [18:25:51] right, so that breaks new mw installs I guess [18:25:53] bd808: true [18:26:35] YuviPanda: https://phabricator.wikimedia.org/D13 [18:26:43] JohnFLewis: :) woot, looks perfect [18:26:45] robh: I stole the whole education coop/collab ticket from you and it's resolved now. just a fyi :) [18:26:45] YuviPanda: looks like modules/mediawiki/manifests/scap.pp should be changed to run as another user [18:26:47] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1748922 (10JanZerebecki) As long as a separate public suppression event exists that refers to the old one it sounds fine. [18:27:02] bd808: yeah am making a patch [18:27:44] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1748928 (10fgiunchedi) I'll add a bit of context here, ansible can't be "locked down" or restricted in its actions with sudo because it executes a custom p... [18:27:51] JohnFLewis: cool, i didnt notice i have so many tasks =] [18:27:54] thx! [18:28:17] (03CR) 10Filippo Giunchedi: [C: 04-2] "see ticket" [puppet] - 10https://gerrit.wikimedia.org/r/248378 (https://phabricator.wikimedia.org/T116169) (owner: 10Dzahn) [18:29:06] (03PS1) 10Yuvipanda: mw: Make initial scap run as mwdeploy user [puppet] - 10https://gerrit.wikimedia.org/r/248398 [18:29:08] bd808: ^ [18:29:32] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:30:24] (03CR) 10BryanDavis: [C: 031] "Fixes breakage for sync-common created by https://phabricator.wikimedia.org/D13" [puppet] - 10https://gerrit.wikimedia.org/r/248398 (owner: 10Yuvipanda) [18:30:47] (03PS2) 10Yuvipanda: mw: Make initial scap run as mwdeploy user [puppet] - 10https://gerrit.wikimedia.org/r/248398 [18:30:48] bd808: thanks! [18:30:55] (03CR) 10Yuvipanda: [C: 032 V: 032] mw: Make initial scap run as mwdeploy user [puppet] - 10https://gerrit.wikimedia.org/r/248398 (owner: 10Yuvipanda) [18:32:51] Notice: /Stage[main]/Mediawiki::Scap/Exec[fetch_mediawiki]/returns: @ERROR: access denied to common from nobelium.eqiad.wmnet (10.64.37.14) [18:32:53] Notice: /Stage[main]/Mediawiki::Scap/Exec[fetch_mediawiki]/returns: rsync error: error starting client-server protocol (code 5) at main.c(1653) [Receiver=3.1.0] [18:32:55] now [18:32:57] bd808: ^ [18:33:04] what's 'common'? [18:33:11] is that rsync ip 'allow' issue? [18:33:18] that's the rsycn export on tin [18:33:28] (03PS13) 10Madhuvishy: burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) [18:33:30] access is from ::networks or something like that [18:33:31] ah, is that in network.pp or something? [18:34:12] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1686 bytes in 0.025 second response time [18:35:12] (03PS1) 10Yuvipanda: scap: Allow nobelium to run scap [puppet] - 10https://gerrit.wikimedia.org/r/248399 [18:35:14] bd808: ^ [18:35:57] YuviPanda: does it need ipv6 too? [18:36:22] RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [18:36:31] bd808: good catch, I'm not sure but might as wlel.. [18:37:12] (03PS2) 10Yuvipanda: scap: Allow nobelium to run scap [puppet] - 10https://gerrit.wikimedia.org/r/248399 [18:37:18] bd808: added! [18:37:52] PROBLEM - mediawiki-installation DSH group on nobelium is CRITICAL: Host nobelium is not in mediawiki-installation dsh group [18:38:08] (03CR) 10BryanDavis: [C: 031] "Will need a forced puppet run on tin to change rsync server acls." [puppet] - 10https://gerrit.wikimedia.org/r/248399 (owner: 10Yuvipanda) [18:38:27] (03PS3) 10Yuvipanda: scap: Allow nobelium to run scap [puppet] - 10https://gerrit.wikimedia.org/r/248399 [18:38:34] (03CR) 10Yuvipanda: [C: 032 V: 032] scap: Allow nobelium to run scap [puppet] - 10https://gerrit.wikimedia.org/r/248399 (owner: 10Yuvipanda) [18:40:09] So I can't log into this host as my own user, but I can do so as mwdeploy? [18:40:18] interesting.. [18:40:52] robh: did you already create all the rack/cable/install tickets for the labs test cluster? If not, should I? [18:41:31] Krenair: the host isn't actually a mediawiki machine, so there will be some oddities [18:41:40] andrewbogott: you mean the ones waiting on the network tasks? [18:41:47] i cannot create the onsite tasks for where to move them until thats done [18:41:56] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1748998 (10Eevans) >>! In T116247#1748095, @Ottomata wrote: >> So the producer would store the same time stamp twice? UUID v1 already contains it. >... [18:41:59] since we dont know where they'll go. [18:42:02] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [18:42:07] robh: ok, got it. [18:42:14] i thoguht we had this discussion yesterday? ;] [18:42:15] robh: where is that at? [18:42:21] phab wise [18:42:29] https://phabricator.wikimedia.org/T114435 [18:42:38] that’s the, um, epic [18:42:44] 6operations, 6Labs, 10Labs-Infrastructure, 3labs-sprint-117: Allocate subnet for labs test cluster instances - https://phabricator.wikimedia.org/T115492#1748999 (10RobH) [18:42:50] https://phabricator.wikimedia.org/T114435 [18:42:54] has the blocking network tickets assgined to mark [18:43:22] (03CR) 10Dzahn: "cant run in compiler, because:" [puppet] - 10https://gerrit.wikimedia.org/r/247217 (owner: 10Muehlenhoff) [18:43:24] robh: Yeah, now that you say that of course we can’t do anything else until we know about subnets :) [18:43:26] 6operations, 6Security, 5Patch-For-Review, 7Security-General: determine validity of Christian Aistleitner (qchris's) shell account - https://phabricator.wikimedia.org/T104254#1749001 (10Tnegrin) [continued] so I would support keeping the status quo for now. [18:43:30] ok thanks [18:43:40] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1749002 (10RobH) Note I cannot create the onsite allocation of these sysetms until the network is setup for labs in codfw; due to the fact I'm not sure... [18:43:45] that killed rsync [18:43:45] i've appended a comment to the end to clarify [18:44:02] * YuviPanda investigates [18:44:15] andrewbogott / chasemp: the boxes are allocated and approved i jsut need the netwokring done to know where they need to be. [18:44:24] sure gotcha [18:44:26] running down the block to swap laundry, back in 5m =] [18:45:12] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:45:53] (03PS1) 10Andrew Bogott: Make labvirt1010 a compute node. [puppet] - 10https://gerrit.wikimedia.org/r/248402 [18:46:49] (03PS2) 10Andrew Bogott: Make labvirt1010 a compute node. [puppet] - 10https://gerrit.wikimedia.org/r/248402 [18:48:41] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1686 bytes in 0.017 second response time [18:51:37] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1749023 (10ellery) Special:BanenrLoader currently does not include the campaign as a query parameter. Special:Re... [18:53:23] (03CR) 10Andrew Bogott: [C: 032] Make labvirt1010 a compute node. [puppet] - 10https://gerrit.wikimedia.org/r/248402 (owner: 10Andrew Bogott) [18:53:39] heh I had forgotten how long the initial scap takes [18:54:32] do we have docs on puppet-compiler, mutante? [18:54:49] YuviPanda: sounds like its working at least? woo! [18:54:54] ebernhardson: yeah [18:55:04] ebernhardson: well puppet's waiting for a while, so I suppose it is! [18:55:07] :D [18:55:09] heh [18:56:15] you use https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/build to trigger it? [18:57:02] Krenair: yeah [18:57:59] I should really send some stuff through puppet swat next week [18:58:16] my gerrit list is getting a bit big again [19:00:11] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 5 failures [19:07:39] !log starting nodetool cleanup on restbase-test2001-a [19:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:31] !log starting nodetool cleanup on restbase-test2001-b [19:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:09:15] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:11:21] ebernhardson: done!! [19:13:27] YuviPanda: no mwscript command :S [19:13:41] i thought the maintenance role was deploying that...hmm [19:14:49] ebernhardson: it should've been [19:15:07] ah no [19:15:09] scap::scripts [19:15:39] That's what terbium uses [19:16:06] tin gets it through role::deployment -> scap::master [19:17:22] (03PS1) 10Yuvipanda: es: Remove maint scripts and add scap::scripts [puppet] - 10https://gerrit.wikimedia.org/r/248407 [19:17:25] (03CR) 10jenkins-bot: [V: 04-1] es: Remove maint scripts and add scap::scripts [puppet] - 10https://gerrit.wikimedia.org/r/248407 (owner: 10Yuvipanda) [19:17:39] (03PS2) 10Yuvipanda: es: Remove maint scripts and add scap::scripts [puppet] - 10https://gerrit.wikimedia.org/r/248407 [19:17:54] ebernhardson: lol, mw::maintenance setup all the cronjobs [19:17:59] I've manually killed them all now [19:18:07] (03CR) 10Yuvipanda: [C: 032 V: 032] es: Remove maint scripts and add scap::scripts [puppet] - 10https://gerrit.wikimedia.org/r/248407 (owner: 10Yuvipanda) [19:18:56] killed as in killed from crontab [19:20:04] I don't think any of those would've worked without mwscript and friends anyway [19:20:38] ebernhardson: mwscript exists now [19:21:56] PROBLEM - DPKG on labvirt1010 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:23:28] YuviPanda: looks to be proper, something is still odd (starting a copy attempts to reindex 0 documents)...will debug [19:23:42] ebernhardson: ok [19:24:20] ebernhardson: could be more firewalling although shouldn't be... [19:25:02] YuviPanda: i dont think thats it, i can talk to eqiad cluster via curl (and the error would be different) [19:25:13] ah ok [19:27:56] a rename is creating lag on enwiki [19:28:34] legoktm: ^ [19:28:46] uh [19:29:11] (03PS1) 10Andrew Bogott: Don't create an explicit nova partition on labvirt1010. [puppet] - 10https://gerrit.wikimedia.org/r/248409 [19:29:13] (03PS1) 10Andrew Bogott: Update the ubuntu cloud-archive apt key [puppet] - 10https://gerrit.wikimedia.org/r/248410 [19:30:19] (03CR) 10Andrew Bogott: [C: 032] Don't create an explicit nova partition on labvirt1010. [puppet] - 10https://gerrit.wikimedia.org/r/248409 (owner: 10Andrew Bogott) [19:30:31] (03CR) 10Andrew Bogott: [C: 032] Update the ubuntu cloud-archive apt key [puppet] - 10https://gerrit.wikimedia.org/r/248410 (owner: 10Andrew Bogott) [19:30:51] jynus: db1047? [19:31:03] no, everywhere [19:31:10] except on the most powerful servers [19:31:18] jynus: what's the query? [19:31:26] it is a rename [19:31:30] not a single query [19:31:39] I wonder why that is not more controlled [19:31:49] as in, slower [19:32:50] https://phabricator.wikimedia.org/T116425 [19:32:58] we lock the user out of their account during the rename, so we try and minimize that time [19:33:06] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [19:33:26] well, up to 4 minutes of lag... [19:33:46] but they aren't supposed to rename users with more than 50k edits without a sysadmin watching [19:33:49] how is lag checked, is it checked on all servers? [19:34:15] oh, I do not care much about this particular issue [19:34:40] I am just filing it as a TODO for the overeal "minimizing lag" task [19:35:35] https://phabricator.wikimedia.org/diffusion/EREN/browse/master/RenameUserJob.php [19:35:45] for example, there is 1 minute of lag on db2016, which is understandable because from eqiad point of view is not pooled [19:35:47] It doesn't wait in between queries [19:36:00] but it is something to solve for future work there [19:36:15] do you agree with me about the general idea? [19:36:32] I think so [19:37:02] "something to (slowly) fix, and at least for now identifiy all offenders" [19:37:55] yeah [19:39:01] I've subscrived you but feel free to unsubscribe [19:39:29] I already watch all Renameuser-related tasks :) [19:39:47] It is a good thing, then I saw the 1 minute lag it thought it was something worse :-) [19:39:56] *when [19:44:02] I think I own you some work about modifying the revision table, we will talk soon [19:46:03] if only we had used mongodb we wouldn't have problems iwth pesky schemas! [19:46:37] funny that you say that, because this is a problem of denomalization of the user name [19:47:09] if we had a bit of extra normalization in the username - user id, renameuser tasks wouldn't be needed [19:47:35] (I know you were joking) [19:47:57] :P [19:48:00] but in this case, mongodb would be even worse [19:48:06] unlike in other cases [19:48:16] my username is stuck in no-man's land because merging [19:48:20] YuviPanda and yuvipanda [19:48:23] no [19:48:26] legoktm: what exactly was the end status of that? [19:48:37] one day, when I have all the time in the world [19:48:49] YuviPanda: YuviPanda is ok to use, Yuvipanda is locked [19:49:15] legoktm: did all the contribs and stuff make it through? [19:49:16] I will propose a redisign, that I will myself block to apply [19:49:28] YuviPanda: not all [19:49:38] legoktm: will they ever move? :D [19:49:48] jynus: something like https://phabricator.wikimedia.org/T33863 ? [19:50:23] yes, I supposed I wasn't the first one [19:50:58] I think it is feasable, it will only hurt once [19:51:17] but the amount of computing time and pain that will save in the long term? [19:51:42] > Adding Yuvi to this bug since he said he'd take a look at this. [19:51:47] aww when I was still doing mw stuff [19:52:20] For ops that join now, the s1 lag is a real issue [19:52:31] but I think it is now under control [19:53:20] I've filed https://phabricator.wikimedia.org/T116425 for long term [19:55:48] ok, I'm going for lunch now, please ping me if I need to do anything [20:00:26] (03PS1) 10Andrew Bogott: Revert "Update the ubuntu cloud-archive apt key" [puppet] - 10https://gerrit.wikimedia.org/r/248474 [20:01:36] RECOVERY - DPKG on labvirt1010 is OK: All packages OK [20:04:35] !log rename user job had created lag on almost all enwiki dbs, things should be better now [20:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:31] (03PS1) 10Alex Monk: Remove old unused wmgUseAPIRequestLog code referencing locke, a pmtpa host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248475 [20:15:55] wmf-config/wgConfVHosts.php: // 'wikimedia.org' // Removed 2008-09-30 by brion -- breaks codereview-proxy.wikimedia.org [20:15:56] hm [20:16:19] (03PS1) 10Alex Monk: Remove old bugzilla and mingle.corp RSS whitelist entries from mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248478 [20:17:01] Heh [20:20:34] 6operations, 10OCG-General-or-Unknown: ocg alarm ocg_job_status_queue 'flapping' - https://phabricator.wikimedia.org/T97524#1749320 (10Aklapper) >>! In T97524#1491476, @fgiunchedi wrote: > the warnings are back, @cscott are these actionable or we should be alerting or something else like failed jobs? ^ @cscot... [20:23:30] even with that commented out, it breaks it now ;) [20:27:59] YuviPanda: is the rsync to labstore from stat boxes active? [20:28:09] madhuvishy: should be. [20:28:24] madhuvishy: it's just an open port of sorts, I think. ottomata had some docs but I forgot where they are [20:28:29] YuviPanda: okay, asking because wiki page says its not [20:28:34] https://wikitech.wikimedia.org/wiki/Analytics/FAQ#How_do_I_transfer_public_files_from_stat_boxes_to_labs [20:28:51] oh [20:28:54] YuviPanda: hmmm, do i need access to labstore1003 to get it to work? [20:28:56] ottomata: didn't you get it to work? [20:28:58] madhuvishy: no [20:29:08] 6operations, 6Analytics-Backlog, 10Datasets-General-or-Unknown: Requests to dumps.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116430#1749332 (10Addshore) [20:29:11] YuviPanda: just the puppet change? [20:29:11] 6operations, 6Analytics-Backlog, 10Wikimedia-Mailing-lists: Requests to lists.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116429#1749325 (10Addshore) [20:29:12] madhuvishy: just the commandline should work [20:29:35] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: puppet fail [20:29:42] madhuvishy: what're you trying to do? [20:29:45] YuviPanda: but, it says if there's no /public/statistics directory you need to first change puppet [20:29:48] is this for the art-recs pipeline? [20:29:51] madhuvishy: aaaaah [20:29:55] madhuvishy: yes, right. in the labs instnace [20:29:57] i wanna get some pageview data from hadoop [20:29:59] madhuvishy: which project is this for? [20:30:06] into druid1.analytics [20:30:08] madhuvishy: that's to get access on the labs side [20:30:13] madhuvishy: so the analytics project? [20:30:16] yes [20:31:38] (03PS1) 10Yuvipanda: labs: Add statistics NFS mount to analytics project [puppet] - 10https://gerrit.wikimedia.org/r/248481 [20:31:40] madhuvishy: ^ [20:31:41] is the chang [20:32:02] (03CR) 10Yuvipanda: [C: 032 V: 032] "Requested by madhuvishy" [puppet] - 10https://gerrit.wikimedia.org/r/248481 (owner: 10Yuvipanda) [20:32:28] YuviPanda: thanks! [20:33:34] madhuvishy: do a puppet run on the instance and lmk if it works? [20:33:40] YuviPanda: doing [20:34:17] YuviPanda: cooool it showed up [20:34:32] madhuvishy: awesome [20:34:44] madhuvishy: can you edit docs about how to make it show up? :D [20:37:37] I shall slink away for food now [20:38:36] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:41:35] YuviPanda: the rsync itself does not seem to work [20:41:38] the directory showed up [20:43:07] madhuvishy: hmm, ottomata ottomata ottomataaaaaaaaaaaaaaaaaaaaaaaaaaaa [20:43:17] YuviPanda: lol [20:43:18] (he did all the rsync bits, I've never used them) [20:43:54] YuviPanda: okay :) i'll find out, thanks :) [20:44:28] madhuvishy: yw! and sorry... [20:44:33] anyway, foood. for real. [20:47:20] with you in 5 ins [20:47:21] mins [20:48:15] ottomata: not to worry, I figured it out, it works alright [20:48:36] YuviPanda: so wierd unexplained thing of the day...on nobelium if i curl elastic1001:9200/_status i get the expected result, but talking to search.svc.eqiad.wmnet i end up talking to localhost [20:48:46] 6operations, 10ops-ulsfo: Move NTT @ ulsfo to a different cross-connect - https://phabricator.wikimedia.org/T112154#1749399 (10RobH) There has been a large amount of out of band work on this. Currently the cross-connections are finally in place @ ULSFO. They were submitted at the start of the week, but were... [20:48:47] YuviPanda: any clue how that can be? some funny lvs thing? [20:50:35] YuviPanda: this about sums it up: https://phabricator.wikimedia.org/P2223 [20:53:36] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1749424 (10Jgreen) >>! In T97676#1749023, @ellery wrote: > Special:BanenrLoader currently does not include the ca... [20:53:36] (03PS1) 10RobH: updating procurement project direct domain emails [puppet] - 10https://gerrit.wikimedia.org/r/248484 [20:54:37] (03PS2) 10RobH: updating procurement project direct domain emails [puppet] - 10https://gerrit.wikimedia.org/r/248484 [20:54:43] (03CR) 10RobH: [C: 032] updating procurement project direct domain emails [puppet] - 10https://gerrit.wikimedia.org/r/248484 (owner: 10RobH) [21:01:49] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1749449 (10ellery) Hmm. I thought Special:RecordImpression was retired and replaced by Special:BannerLoader. Andy... [21:04:29] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1749452 (10Ottomata) Right, but how would you do this in say, Hive? Or in bash? Timestamp logic should be easy and immediate. > Regarding a separa... [21:05:42] (03CR) 10Andrew Bogott: [C: 032] Revert "Update the ubuntu cloud-archive apt key" [puppet] - 10https://gerrit.wikimedia.org/r/248474 (owner: 10Andrew Bogott) [21:06:20] (03PS2) 10Andrew Bogott: Revert "Update the ubuntu cloud-archive apt key" [puppet] - 10https://gerrit.wikimedia.org/r/248474 [21:12:45] PROBLEM - Host nobelium is DOWN: PING CRITICAL - Packet loss = 100% [21:14:17] (03Abandoned) 10Andrew Bogott: Revert "Update the ubuntu cloud-archive apt key" [puppet] - 10https://gerrit.wikimedia.org/r/248474 (owner: 10Andrew Bogott) [21:14:52] 6operations, 6Phabricator, 6Project-Creators: create acl*operationsteam & acl*procurement projects, cease using #operations for access control - https://phabricator.wikimedia.org/T114135#1749466 (10RobH) Just for historical notes, I also had to change all the operations clinic dashboards, as the old queries... [21:19:35] YuviPanda: ^^ [21:19:58] YuviPanda: PROBLEM - Host nobelium is DOWN: PING CRITICAL - Packet loss = 100%. doesn't have to be fixed(not a prod service) now...but i'm not sure what happened [21:21:03] ugh [21:21:06] (I'm out eating) [21:21:17] mutante: can you take a look? powercycle maybe? [21:21:57] ok [21:22:33] how much "not a prod" :) [21:22:44] tries mgmt [21:24:03] !log nobelium - powercycled, no console output [21:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:24:30] mutante: its a test server that will be decom'd monday after next. Its testing making a copy of the elasticsearch cluster available in labs [21:24:38] basically, noone uses it for anything except me and yuvi [21:25:12] heavy disk usage shouldn't crash it, but the only graph that looks interesting is http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1445635416.601&target=servers.nobelium.iostat.md1.iops&from=-1hours [21:25:29] ebernhardson: ok, thanks for the explanation. i believe it is coming back now [21:25:31] iops jumped up to >1000 (i never know what the units in graphite are...) for a bit and then crashed [21:25:32] i am watching it boot [21:25:34] thanks [21:25:58] nobelium login: [21:26:16] RECOVERY - Host nobelium is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [21:26:17] mutante: yup its back, thanks! [21:26:19] yea, just powercycled, there was no output on the serial console [21:26:23] np [21:27:16] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [21:28:57] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [21:29:32] 6operations, 10Wikimedia-Mailing-lists: Provide mbox archives and add missing lists to Gmane - https://phabricator.wikimedia.org/T59246#1749500 (10Nemo_bis) There isn't any need to email Lars now. First the mailing lists need to be created on Gmane per the usual process, *then* the mbox files can be provided a... [21:30:20] 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1749501 (10Krenair) We already have tin:/srv/mediawiki-staging/wmf-config/.svn... I'm not sure we can use it to get the full history etc. though? [21:31:57] 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1749505 (10Reedy) I guess we ideally need /home/wikipedia/conf-svn/wmf-config for the actual svn repo... [21:31:59] 6operations, 10Wikimedia-Mailing-lists: Provide mbox archives and add missing lists to Gmane - https://phabricator.wikimedia.org/T59246#1749506 (10JohnLewis) Even the instructions are unclear and seem fairly directed at '1 or 2' lists. When you get into the boundaries of near a hundred, just emailing gets comp... [21:32:10] 6operations: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1749508 (10aaron) [21:35:30] looks like it might have just overheated..went from "Package temperature above threshold, cpu clock throttled (total events = 1)" to "Package temperature above threshold, cpu clock throttled (total events = 87146)" in the span of 10 minutes, then went down [21:36:44] 6operations, 10Wikimedia-Mailing-lists: Provide mbox archives and add missing lists to Gmane - https://phabricator.wikimedia.org/T59246#1749524 (10Nemo_bis) > When you get into the boundaries of near a hundred, just emailing gets complicated, length and annoying for everyone involved. I'm not sure what you're... [21:39:27] 6operations, 10Wikimedia-Mailing-lists: Provide mbox archives and add missing lists to Gmane - https://phabricator.wikimedia.org/T59246#1749529 (10JohnLewis) I have actually just received an email from a Gmane administrator. They've provided me a list of all lists that are currently being monitored and clarity... [21:46:26] my uneducated guess..something is wrong with nobelium's cooling system. htop shows the load moves arround the processors, but one processors is ~70 degrees C, and the other is 94 degrees [21:46:42] err, one socket is 70C and the other socket is 94C [21:46:51] (with crit=94C, its throttling) [21:48:28] ebernhardson: back [21:48:30] ebernhardson: ugh, that sucks [21:48:39] ebernhardson: did you ever manage to load its CPUs at all? [21:48:46] ebernhardson: can you file a bug too? I'll try look at the lvs thing now [21:48:49] YuviPanda: it was sitting around 40% cpu usage [21:48:55] almost entirely from java [21:49:07] (turns out hhvm runs the import process with 10x less cpu than php 5.3.10) [21:49:12] haha [21:49:14] nice [21:49:21] ebernhardson: oh wait, so the import process is running already? [21:49:23] niceee [21:49:36] we should really switch terbium and tin over at some point [21:49:41] YuviPanda: yea i started up the import, but i've turned it back off since the machine doesn't seem happy [21:49:53] i could let it run and see if it crashes again though [21:49:56] ebernhardson: yeah [21:49:56] (03PS1) 10Dzahn: add '15' to list of project languages [dns] - 10https://gerrit.wikimedia.org/r/248504 [21:50:14] Reedy: ^ :o [21:50:25] switch tin and terbium over, YuviPanda? [21:50:30] oh, to hhvm? [21:50:32] Krenair: yeah [21:50:36] indeed [21:50:43] mutante: didn't that get done? [21:50:44] Krenair: since the eqiad -> codfw copy is on terbium [21:50:51] Reedy: no [21:50:52] uh [21:50:55] denied [21:50:58] (03CR) 10Alex Monk: "I don't think this is necessary..." [dns] - 10https://gerrit.wikimedia.org/r/248504 (owner: 10Dzahn) [21:51:21] Reedy: https://phabricator.wikimedia.org/T599#1749450 last 5 comments [21:51:43] For the other misc sites like this we add entries to templates/wikipedia.org [21:51:51] alex@alex-laptop:~/Development/Wikimedia/Operations-DNS (master)$ grep ten templates/wikipedia.org [21:51:51] ten 600 IN DYNA geoip!text-addrs [21:51:51] ten.m 600 IN DYNA geoip!mobile-addrs [21:51:54] (03CR) 10Dzahn: "Alex Monk: please add on https://phabricator.wikimedia.org/T599#1749450" [dns] - 10https://gerrit.wikimedia.org/r/248504 (owner: 10Dzahn) [21:53:26] (03PS1) 10MaxSem: [WIP] Switch www.wikipedia.org to be deployed from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248526 (https://phabricator.wikimedia.org/T115964) [21:53:33] 6operations: nobelium is overheating - https://phabricator.wikimedia.org/T116439#1749580 (10EBernhardson) 3NEW [21:53:45] (03CR) 10MaxSem: [C: 04-2] [WIP] Switch www.wikipedia.org to be deployed from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248526 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [21:54:01] 6operations: nobelium is overheating - https://phabricator.wikimedia.org/T116439#1749587 (10EBernhardson) [21:56:36] YuviPanda, what? [21:56:44] the elasticsearch copy is on terbium? [21:56:50] 6operations, 10Wikimedia-Mailing-lists: Provide mbox archives and add missing lists to Gmane - https://phabricator.wikimedia.org/T59246#1749597 (10JohnLewis) See P2224 which is a full list of all lists that don't exist in Gmane but should. I'll have these made in Gmane and then put the mboxes behind a passwor... [21:57:03] *That's* the reason you want it switched? [21:57:33] 6operations: nobelium is overheating - https://phabricator.wikimedia.org/T116439#1749600 (10EBernhardson) [21:58:30] Krenair: no, because often enough something cpu heavy is 10x less resource intensive on hvm [21:58:44] Krenair: this particular copy is just another thing that runs into it [22:00:57] robh: cmjohnson any idea how we can investigate https://phabricator.wikimedia.org/T116439 [22:01:36] well, first step is chekcing the thermal paste it seems they dry out over time [22:01:44] so i'll append the onsite tag and chris can check it out [22:01:57] iirc that was a problem with a shit ton of apaches in the recent past. [22:02:06] he had to go on a re-pasting spree [22:03:13] fun :) [22:03:15] robh: thanks! [22:03:34] 6operations, 10ops-eqiad: nobelium is overheating - https://phabricator.wikimedia.org/T116439#1749623 (10RobH) I vaguely recall a rash of overheating cpus recently, in which @cmjohnson had to re-apply thermal paste to a number of systems. [22:04:24] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1749626 (10GWicke) > Right, but how would you do this in say, Hive? Or in bash? Timestamp logic should be easy and immediate. Yeah, Hive really seem... [22:04:49] YuviPanda: if its not that its still under warranty [22:04:53] so thats good =] [22:05:15] robh: ah nice [22:05:26] but i bet its totally dried out thermal paste [22:05:31] but I hope we won't have to go that far since I'll have to do an approvals dance again (in 10d or so) [22:05:42] does sound very likley for that sort of uneven cooling [22:05:56] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: puppet fail [22:06:15] ebernhardson: hmm, so the import is not running now, just to confirm? (I don't see it) [22:06:24] YuviPanda: right, i stopped it. want me to start it up? [22:06:31] ebernhardson: yeah [22:06:43] ebernhardson: https://xkcd.com/242/ and all that [22:07:06] YuviPanda: started [22:07:43] (it takes awhile to heat everything up) [22:09:39] let's see if it dies again [22:09:42] ok [22:14:19] (03PS2) 10Dzahn: add 15.wikipedia.org and 15.m.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/248504 (https://phabricator.wikimedia.org/T599) [22:14:57] YuviPanda: i think it might have just given up the ghost, no more ping response [22:15:12] yup [22:15:14] looks like it [22:15:45] I suppose this basically means we can't really do a copy until there is a physical inspection + possible fix [22:15:53] yea [22:15:58] :( [22:16:29] but i think i can run the codfw copy from here at least :) [22:16:53] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1749660 (10GWicke) [22:17:15] ebernhardson: heh [22:17:25] PROBLEM - Host nobelium is DOWN: PING CRITICAL - Packet loss = 100% [22:18:00] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1744407 (10GWicke) I went ahead and updated the task description with the current framing / per-event schema. I renamed the `reqid` to just `id`, an... [22:21:47] robh: mutante ugh, my gpg setup is having issues. can you powercycle nobelium again? [22:21:49] (sorry) [22:22:05] ya [22:22:29] YuviPanda: hrmm [22:23:12] it wont console via mgmt [22:23:12] is someone else on it still? [22:23:12] ouch [22:23:12] robh: mutante was on it last [22:23:12] mutante: you still console on nobelium? [22:23:12] YuviPanda: im trying to see if the os is locked before i blind reboot. [22:23:12] ok [22:23:14] 6operations, 10ops-eqiad: nobelium is overheating - https://phabricator.wikimedia.org/T116439#1749682 (10yuvipanda) Is consistently reproducible. Tax CPU, host goes down. [22:23:15] it doesn't respond to ping tho [22:23:16] robh: yes, quitting now [22:23:22] done [22:23:24] yea, its hard locked [22:23:29] powercyling [22:23:50] !log powercycled nobelium [22:23:52] I guess cmjohnson isn't in the DC anymore, it's late on a friday. [22:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:24:12] YuviPanda: its evening there so i would imagine he is not [22:24:29] yeah [22:24:29] oh yeah, TZ as well [22:25:38] ebernhardson: I wonder if we can 'ban' a particular cpu [22:25:50] but I guess we'll just have to wait till monday. [22:25:51] YuviPanda: hmm, not sure [22:26:06] RECOVERY - Host nobelium is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [22:26:35] we can do that but it's friday eveningish so I guess let's not. [22:29:12] YuviPanda: you can set the CPUFreq governor to powersave [22:29:18] we set it to "performance" everywhere [22:29:55] just add class { 'cpufrequtils': governor => 'powersave' } [22:30:12] you could probably also scale cpu frequency down in the bios via mgmt [22:31:17] PROBLEM - High load average on labstore1002 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0] [22:32:51] ori: true, and I think you can also disable a core from being used at all [22:33:17] but I don't want to try any of them now, in an effort to sustain more 'do not work until 9pm everyday dammit!' [22:34:26] RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [22:36:31] 6operations, 6Analytics-Backlog, 10Wikimedia-Mailing-lists: Requests to lists.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116429#1749735 (10Dzahn) What are you trying to find out? [22:37:26] ori: is there a reason why we aren't using 'ondemand'? [22:37:57] 6operations, 6Analytics-Backlog, 10Datasets-General-or-Unknown: Requests to dumps.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116430#1749739 (10Dzahn) duplicate of T116429 [22:38:26] RECOVERY - High load average on labstore1002 is OK: OK: Less than 50.00% above the threshold [16.0] [22:38:59] 6operations, 6Analytics-Backlog, 10Datasets-General-or-Unknown: Requests to dumps.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116430#1749749 (10Dzahn) [22:44:05] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1749757 (10AndyRussG) @ellery, sadly we haven't yet been able to ditch Special:RecordImpression for Special:Banne... [22:50:23] 6operations, 6Analytics-Backlog, 10Wikimedia-Mailing-lists: Requests to lists.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116429#1749766 (10Dzahn) p:5Triage>3Normal [22:51:18] 6operations, 10ops-eqiad: nobelium is overheating - https://phabricator.wikimedia.org/T116439#1749768 (10Dzahn) p:5Triage>3Normal normal priority per "not a prod system" [23:08:32] (03PS1) 10Ori.livneh: Update flamegraph.pl to brendangregg/Flamegraph@182b24f [puppet] - 10https://gerrit.wikimedia.org/r/248561 [23:08:46] (03PS2) 10Ori.livneh: Update flamegraph.pl to brendangregg/Flamegraph@182b24f [puppet] - 10https://gerrit.wikimedia.org/r/248561 [23:08:53] (03CR) 10Ori.livneh: [C: 032 V: 032] Update flamegraph.pl to brendangregg/Flamegraph@182b24f [puppet] - 10https://gerrit.wikimedia.org/r/248561 (owner: 10Ori.livneh) [23:09:04] !log Started a local rename for Stefan4 to Stefan2 on commons, per request (and attributed to) DerHexer [23:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:32] (03CR) 10MaxSem: [C: 04-1] "Will add my comments to the bug shortly." [dns] - 10https://gerrit.wikimedia.org/r/248504 (https://phabricator.wikimedia.org/T599) (owner: 10Dzahn) [23:31:35] !log mwscript deleteEqualMessages.php --wiki ukwiki [23:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:31:52] !log mwscript deleteEqualMessages.php --wiki hewikisource [23:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:32:02] 6operations, 10Wikimedia-Mailing-lists: Provide mbox archives and add missing lists to Gmane - https://phabricator.wikimedia.org/T59246#1749836 (10RobH) I'm interested to know if there are any considerable downsides to setting all public lists to also have publicly accessible mbox files? This would prevent fu... [23:38:52] mutante, is there a way to get an interactive ruby console with the same sort of context as is set up for the puppet templates? [23:39:19] with the ability to use scope.lookupvar etc.