[01:45:32] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL: CRITICAL: 34.62% of data above the critical threshold [100000000.0] [02:27:10] !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 08m 14s) [02:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:47] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-26 02:31:46+00:00 [02:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:55:32] RECOVERY - Outgoing network saturation on labstore1002 is OK: OK: Less than 10.00% above the threshold [75000000.0] [02:56:12] PROBLEM - puppet last run on lvs2003 is CRITICAL: CRITICAL: puppet fail [03:22:32] RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:43:11] PROBLEM - puppet last run on mw2086 is CRITICAL: CRITICAL: Puppet has 1 failures [04:09:31] RECOVERY - puppet last run on mw2086 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [05:48:20] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Oct 26 05:48:19 UTC 2015 (duration 48m 18s) [05:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:29:52] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:42] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:51] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:52] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:11] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:33] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:52] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:12] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:31] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:03] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:40:35] <_joe_> oh, DST [06:40:40] <_joe_> I was worrying :P [06:55:12] RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:55:22] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:55:41] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:55:42] RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:56:02] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:02] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:56:32] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:57:11] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:31] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:35:49] (03CR) 10Giuseppe Lavagetto: [C: 032] Convert logging from print to twisted.python.log [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 (owner: 10Giuseppe Lavagetto) [07:36:24] (03Merged) 10jenkins-bot: Convert logging from print to twisted.python.log [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 (owner: 10Giuseppe Lavagetto) [07:45:15] (03CR) 10Giuseppe Lavagetto: [C: 032] Re-adding PyBalConfigurationObserverError [debs/pybal] - 10https://gerrit.wikimedia.org/r/244669 (owner: 10Giuseppe Lavagetto) [07:45:42] (03Merged) 10jenkins-bot: Re-adding PyBalConfigurationObserverError [debs/pybal] - 10https://gerrit.wikimedia.org/r/244669 (owner: 10Giuseppe Lavagetto) [07:58:15] (03PS2) 10Muehlenhoff: Move the base::firewall include into the impala role [puppet] - 10https://gerrit.wikimedia.org/r/246221 [08:00:13] (03PS3) 10Muehlenhoff: Assign salt grains for kafkatee [puppet] - 10https://gerrit.wikimedia.org/r/248329 [08:01:30] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for kafkatee [puppet] - 10https://gerrit.wikimedia.org/r/248329 (owner: 10Muehlenhoff) [08:05:12] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 out: 300 virgin: 25) [08:06:50] (03PS2) 10Muehlenhoff: Assign salt grains for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/248330 [08:07:11] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/248330 (owner: 10Muehlenhoff) [08:09:57] (03PS7) 10Giuseppe Lavagetto: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [08:10:26] (03CR) 10jenkins-bot: [V: 04-1] Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [08:15:15] (03CR) 10ArielGlenn: "Do you have an estimate of how long it will take this to run on the larger (wikidata, commons, en wiki) wikis?" [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson) [08:17:34] (03PS2) 10Muehlenhoff: Assign salt grains for labvirt/nova compute [puppet] - 10https://gerrit.wikimedia.org/r/248331 [08:17:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for labvirt/nova compute [puppet] - 10https://gerrit.wikimedia.org/r/248331 (owner: 10Muehlenhoff) [08:19:37] (03PS2) 10Muehlenhoff: Assign salt grains for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/248332 [08:20:15] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/248332 (owner: 10Muehlenhoff) [08:20:42] (03PS2) 10Muehlenhoff: Assign salt grains for terbium [puppet] - 10https://gerrit.wikimedia.org/r/248333 [08:20:51] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [08:22:59] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for terbium [puppet] - 10https://gerrit.wikimedia.org/r/248333 (owner: 10Muehlenhoff) [08:28:14] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1752416 (10akosiaris) The reason is almost certainly this: https://phabricator.wikimedia.org/P2231 Dropping and recreating the PRIMARY KEY in a 100... [08:28:37] (03PS1) 10ArielGlenn: dumps: one more conf file not updated for new path of dblists [puppet] - 10https://gerrit.wikimedia.org/r/248822 [08:29:39] (03CR) 10ArielGlenn: [C: 032] dumps: one more conf file not updated for new path of dblists [puppet] - 10https://gerrit.wikimedia.org/r/248822 (owner: 10ArielGlenn) [08:29:41] (03PS2) 10Muehlenhoff: Assign salt grains for spark [puppet] - 10https://gerrit.wikimedia.org/r/248334 [08:30:23] (03PS3) 10Muehlenhoff: Assign salt grains for spark [puppet] - 10https://gerrit.wikimedia.org/r/248334 [08:30:36] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for spark [puppet] - 10https://gerrit.wikimedia.org/r/248334 (owner: 10Muehlenhoff) [08:44:03] (03PS8) 10Giuseppe Lavagetto: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [08:44:30] (03CR) 10jenkins-bot: [V: 04-1] Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [08:53:48] (03PS2) 10Muehlenhoff: Assign salt grains for hue [puppet] - 10https://gerrit.wikimedia.org/r/248335 [08:55:02] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for hue [puppet] - 10https://gerrit.wikimedia.org/r/248335 (owner: 10Muehlenhoff) [08:58:28] (03PS2) 10Muehlenhoff: Assign salt grains for ci [puppet] - 10https://gerrit.wikimedia.org/r/248336 [09:00:38] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for ci [puppet] - 10https://gerrit.wikimedia.org/r/248336 (owner: 10Muehlenhoff) [09:04:34] (03PS2) 10Muehlenhoff: Assign salt grains for db analytics/sanitarium [puppet] - 10https://gerrit.wikimedia.org/r/248337 [09:05:17] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for db analytics/sanitarium [puppet] - 10https://gerrit.wikimedia.org/r/248337 (owner: 10Muehlenhoff) [09:06:12] (03PS1) 10ArielGlenn: dumps: update listing of files for rsync to lastest rsync args [puppet] - 10https://gerrit.wikimedia.org/r/248824 [09:06:49] (03PS2) 10Muehlenhoff: Assign salt grains for pool counters [puppet] - 10https://gerrit.wikimedia.org/r/248338 [09:07:11] (03PS2) 10ArielGlenn: dumps: update listing of files for rsync to lastest rsync args [puppet] - 10https://gerrit.wikimedia.org/r/248824 [09:07:25] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for pool counters [puppet] - 10https://gerrit.wikimedia.org/r/248338 (owner: 10Muehlenhoff) [09:08:12] (03PS3) 10ArielGlenn: dumps: update listing of files for rsync to lastest rsync args [puppet] - 10https://gerrit.wikimedia.org/r/248824 [09:08:20] (03PS3) 10Muehlenhoff: Move the base::firewall include into the impala role [puppet] - 10https://gerrit.wikimedia.org/r/246221 [09:09:01] (03CR) 10ArielGlenn: [C: 032] dumps: update listing of files for rsync to lastest rsync args [puppet] - 10https://gerrit.wikimedia.org/r/248824 (owner: 10ArielGlenn) [09:09:22] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move the base::firewall include into the impala role [puppet] - 10https://gerrit.wikimedia.org/r/246221 (owner: 10Muehlenhoff) [09:09:36] (03PS4) 10Muehlenhoff: Move the base::firewall include into the impala role [puppet] - 10https://gerrit.wikimedia.org/r/246221 [09:09:44] (03CR) 10Muehlenhoff: [V: 032] Move the base::firewall include into the impala role [puppet] - 10https://gerrit.wikimedia.org/r/246221 (owner: 10Muehlenhoff) [09:14:06] (03PS2) 10Muehlenhoff: Move base::firewall into the archiva role [puppet] - 10https://gerrit.wikimedia.org/r/245974 [09:14:35] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move base::firewall into the archiva role [puppet] - 10https://gerrit.wikimedia.org/r/245974 (owner: 10Muehlenhoff) [09:16:49] !log rebooting and installing jessie on db2060-db2070 [09:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:20:30] <_joe_> !log restarting etcd on conf1001 [09:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:25:26] <_joe_> uhm the etcd cluster is in bad shape, meh [09:30:10] (03CR) 10ArielGlenn: [C: 032] dumps: don't escape commands not run in shell [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/244799 (owner: 10ArielGlenn) [09:30:39] <_joe_> ok etcd cluster is ok again [09:30:57] (03CR) 10ArielGlenn: [C: 032] dumps: unfix a camelcase, imported module not fixed up yet [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/244800 (owner: 10ArielGlenn) [09:31:06] (03CR) 10ArielGlenn: [V: 032] dumps: unfix a camelcase, imported module not fixed up yet [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/244800 (owner: 10ArielGlenn) [09:32:13] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps; fix another indentation screwup from the pylint [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/244801 (owner: 10ArielGlenn) [09:36:13] apergos: ping re: https://phabricator.wikimedia.org/T87036 [09:36:32] apergos: sorry, that should have been https://phabricator.wikimedia.org/T94277 [09:37:24] ori: need to have a backport of the fix referred to in https://phabricator.wikimedia.org/T113932 [09:37:37] once those packages are available I can do a test run and then covert over tohhvm [09:37:48] they are all running trusty already of course [09:38:56] apergos: bd808 submitted a backport in sept 30: https://gerrit.wikimedia.org/r/#/c/242773/ [09:39:00] yes. [09:39:22] I mean packages built, I can test them if someone can make them available to me [09:39:53] (03CR) 10Jcrespo: [C: 031] Include base::firewall in the mariadb::labsdb role [puppet] - 10https://gerrit.wikimedia.org/r/245958 (owner: 10Muehlenhoff) [09:39:57] ori [09:40:02] i'll build it right now [09:40:10] you will? great! [09:40:35] if it works I won't be able to cut over the hosts til the next run (Nov 1) but that's soming up very soon. [09:40:38] *coming [09:40:39] yes. i don't think it's unreasonable to expect ops to do it, though -- it has been traditionally done by ops. [09:40:52] joe has done those I believe [09:41:56] I don't expect we're going to gain anything from moving these long running maintenance scripts to hhvm, but I understand the desire to have everything running off of one implementation [09:41:57] joe is doing a bajillion other things, though, and he is not directly responsible for the snapshot migration. i really have to ask you to show ownership here. [09:42:15] !log deleting unused elasticsearch indices in eqiad (T112863) [09:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:42:22] he isn't, and you can ask but I looked at these packages when I first investigated, and I really have no idea about building them [09:42:44] the way it looks like from my end is that any blocker is the occasion for all progress stalling completely until i nag [09:42:46] and I did spend a chunk of time poking around in the repos [09:43:12] <_joe_> apergos: ok lemme update what is on wikitech on building HHVM packages [09:43:13] there are packaging instructions at https://wikitech.wikimedia.org/wiki/HHVM [09:43:19] <_joe_> it's super easy now with copper [09:43:33] <_joe_> oh it's already there, see [09:43:50] I don't expect we're going to gain anything from moving these long running maintenance scripts to hhvm [09:43:54] not sure how you reached that conclusion [09:44:40] speed wise what wouod we gain? [09:44:43] *would [09:45:21] I don't know what the workload of the snapshot hosts is, and whether they are primarily IO or CPU bound, but a speedup of x2 is very plausible [09:45:32] the issue with HHVM and CLI invocations is that CLI invocations tend to be short-running, and so HHVM JIT's doesn't have time to pay off [09:45:51] basically by the time the script terminates HHVM has not even finished analyzing the code, let alone optimize it [09:45:59] but this has no bearing on long-running jobs [09:46:05] which benefit tremendously well from HHVM [09:46:20] <_joe_> ori: well if the bytecode gets loaded once I don't think there should be a significant gain, or am I mistaken? [09:46:27] that is what I had thought [09:46:43] (03PS2) 10Alex Monk: beta: Add enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248639 [09:47:04] no, it can recompile in the context of a single invocation [09:47:45] <_joe_> anyways, off to take my meds [09:47:49] good luck [09:48:26] stacci bene [09:49:07] at any rate, privately reaching the conclusion that this task is not worth doing (or not worth prioritizing) is not great, because it doesn't really give anyone a chance to challenge the conclusion and explain [09:49:16] no. I didn't reach that conclusion. [09:49:39] as I said above, standardizing on one implementation is a perfectly valid reason [09:49:57] and even if I had not thought so, it's still a task in my queue, I'm still wanting to get it off of my queue [09:50:44] you don't seem to hear me that I cloned the repo(s) and tried looking at the package structure at the beginning and was overwhelmed [09:51:14] i believe you, but you didn't communicate that on the task, or ask for help [09:51:17] now maybe I should have asked joe or bd808 to please make packages available, as joe had volunteered to do in the past. that is my bad [09:54:46] i don't mean to chew you out. i just need help. the mental cost of rereading the tasks to reconstruct some notion of where things are and what they need to move forward makes me miserable and unproductive. [09:56:06] * ori goes a-packagin'. [09:58:15] <_joe_> ori: I plan on converting tin as soon as possible, FYI [10:08:55] (03PS2) 10Ori.livneh: Backport of D2486378: Implement compress.bzip2:// stream wrapper [debs/hhvm] - 10https://gerrit.wikimedia.org/r/242773 (https://phabricator.wikimedia.org/T113932) (owner: 10BryanDavis) [10:09:04] (03CR) 10Ori.livneh: [C: 032 V: 032] Backport of D2486378: Implement compress.bzip2:// stream wrapper [debs/hhvm] - 10https://gerrit.wikimedia.org/r/242773 (https://phabricator.wikimedia.org/T113932) (owner: 10BryanDavis) [10:12:21] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: puppet fail [10:13:00] <_joe_> what is missing from wikitech is the policy for package release to production [10:13:07] <_joe_> (re HHVM) [10:22:53] PROBLEM - puppet last run on db2041 is CRITICAL: CRITICAL: puppet fail [10:39:12] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:30] (03PS1) 10Muehlenhoff: Assign salt grains for image scalers [puppet] - 10https://gerrit.wikimedia.org/r/248842 [10:45:32] (03PS1) 10Muehlenhoff: Assign salt grains for mw api servers [puppet] - 10https://gerrit.wikimedia.org/r/248843 [10:45:34] (03PS1) 10Muehlenhoff: Assign salt grains for jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/248844 [10:45:36] (03PS1) 10Muehlenhoff: Assign salt grains for application servers [puppet] - 10https://gerrit.wikimedia.org/r/248845 [10:45:38] (03PS1) 10Muehlenhoff: Assign salt grains for librenms [puppet] - 10https://gerrit.wikimedia.org/r/248846 [10:45:40] (03PS1) 10Muehlenhoff: Assign salt grains for graphite [puppet] - 10https://gerrit.wikimedia.org/r/248847 [10:45:42] (03PS1) 10Muehlenhoff: Assign salt grains for openldap [puppet] - 10https://gerrit.wikimedia.org/r/248848 [10:49:15] (03CR) 10Filippo Giunchedi: [C: 031] Add DNS entries for ms-be20[1-2][0-6] Bug:T114712 [dns] - 10https://gerrit.wikimedia.org/r/248712 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul) [10:51:32] RECOVERY - puppet last run on db2041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:53:51] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [10:54:03] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [10:54:21] (03PS1) 10Alexandros Kosiaris: puppetmaster: Move the role into the role module [puppet] - 10https://gerrit.wikimedia.org/r/248850 [10:55:11] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:06] (03Abandoned) 10Muehlenhoff: Make db2055 to db2070 as role spare [puppet] - 10https://gerrit.wikimedia.org/r/246823 (owner: 10Muehlenhoff) [11:04:41] PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection timed out [11:06:21] RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 0.006 second response time on port 9042 [11:10:02] PROBLEM - Restbase root url on aqs1001 is CRITICAL: Connection refused [11:26:01] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [100000000.0] [11:26:21] RECOVERY - Restbase root url on aqs1001 is OK: HTTP OK: HTTP/1.1 200 - 727 bytes in 0.021 second response time [11:26:32] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [11:26:52] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [11:27:51] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [11:48:52] (03PS2) 10Muehlenhoff: Assign salt grains for mw api servers [puppet] - 10https://gerrit.wikimedia.org/r/248843 [11:54:10] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for mw api servers [puppet] - 10https://gerrit.wikimedia.org/r/248843 (owner: 10Muehlenhoff) [12:00:59] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1752738 (10Selsharbaty-WMF) Hi @JohnLewis! Thank you for the quick response and getting this task done very quickly! I really appreciate your help. I just need some clar... [12:01:02] (03PS2) 10Muehlenhoff: Assign salt grains for jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/248844 [12:03:50] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [12:05:32] (03CR) 10Alexandros Kosiaris: [C: 032] "http://puppet-compiler.wmflabs.org/1074/ says noop, apart from palladium which fails due to new_install not being in the labs/private repo" [puppet] - 10https://gerrit.wikimedia.org/r/248850 (owner: 10Alexandros Kosiaris) [12:06:37] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1752752 (10JohnLewis) Ah! I see, thanks for pointing this out to me. I copied the full configuration over from the private list to the public list to ensure all members ar... [12:06:41] (03PS2) 10Alexandros Kosiaris: puppetmaster: Move the role into the role module [puppet] - 10https://gerrit.wikimedia.org/r/248850 [12:07:05] (03CR) 10Alexandros Kosiaris: [V: 032] puppetmaster: Move the role into the role module [puppet] - 10https://gerrit.wikimedia.org/r/248850 (owner: 10Alexandros Kosiaris) [12:10:40] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/248844 (owner: 10Muehlenhoff) [12:10:49] (03PS3) 10Muehlenhoff: Assign salt grains for jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/248844 [12:10:56] (03CR) 10Muehlenhoff: [V: 032] Assign salt grains for jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/248844 (owner: 10Muehlenhoff) [12:11:33] (03PS2) 10Muehlenhoff: Assign salt grains for application servers [puppet] - 10https://gerrit.wikimedia.org/r/248845 [12:15:19] RECOVERY - Restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [12:15:29] RECOVERY - Restbase root url on restbase-test2002 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.109 second response time [12:16:37] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for application servers [puppet] - 10https://gerrit.wikimedia.org/r/248845 (owner: 10Muehlenhoff) [12:22:16] (03PS2) 10Muehlenhoff: Assign salt grains for librenms [puppet] - 10https://gerrit.wikimedia.org/r/248846 [12:24:47] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for librenms [puppet] - 10https://gerrit.wikimedia.org/r/248846 (owner: 10Muehlenhoff) [12:27:35] (03PS2) 10Muehlenhoff: Assign salt grains for graphite [puppet] - 10https://gerrit.wikimedia.org/r/248847 [12:28:25] (03PS9) 10Giuseppe Lavagetto: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [12:28:54] (03CR) 10jenkins-bot: [V: 04-1] Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [12:38:30] (03CR) 10BBlack: "@ottomata: is the new python code tested at all?" [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [12:38:52] (03CR) 10BBlack: "(by that I mean really tested running it somewhere, as opposed to unit tests)" [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [12:39:04] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1752805 (10JohnLewis) This is now done, enjoy the correct list situation :) [12:41:20] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:42:18] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1752806 (10akosiaris) p:5Triage>3Unbreak! [12:42:23] (03PS10) 10Giuseppe Lavagetto: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [12:42:31] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:42:56] (03CR) 10jenkins-bot: [V: 04-1] Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [12:43:56] (03PS11) 10Giuseppe Lavagetto: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [12:44:13] apergos: speaking of your queue, my snapshot patches are still pending unreviewed :) [12:44:20] I know [12:44:26] and they are on my today queue [12:44:26] (03CR) 10jenkins-bot: [V: 04-1] Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [12:44:33] well, two of them are reviewed by giuseppe [12:44:34] * apergos flips over their old-fashioned paper notepad [12:44:38] <_joe_> wtf is wrong with tox? [12:44:56] yep, there they are, #5 for today. I am on #4 righ tnow.... [12:46:15] (03PS12) 10Giuseppe Lavagetto: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [12:47:07] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1752808 (10akosiaris) 5Open>3Resolved a:3akosiaris After a full reinitialization of the slaves, replication is working once more. I see a coup... [12:48:33] (03PS13) 10Giuseppe Lavagetto: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [12:49:17] 6operations: monitor postgresql replication status - https://phabricator.wikimedia.org/T116580#1752811 (10akosiaris) 3NEW [12:53:34] (03PS1) 10Merlijn van Deen: toollabs: install libsort-fields-perl [puppet] - 10https://gerrit.wikimedia.org/r/248861 (https://phabricator.wikimedia.org/T116579) [12:59:28] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1752834 (10Yurik) @akosiaris, sorry for the trouble. Just in case I break it again, could you write the sequence of steps that you did to recover it... [12:59:47] (03PS2) 10coren: toollabs: install libsort-fields-perl [puppet] - 10https://gerrit.wikimedia.org/r/248861 (https://phabricator.wikimedia.org/T116579) (owner: 10Merlijn van Deen) [13:00:04] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151026T1300). [13:03:58] (03CR) 10coren: [C: 032] toollabs: install libsort-fields-perl [puppet] - 10https://gerrit.wikimedia.org/r/248861 (https://phabricator.wikimedia.org/T116579) (owner: 10Merlijn van Deen) [13:04:44] Gah, merge collision. [13:05:11] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 13.64% of data above the critical threshold [100000000.0] [13:06:10] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [13:07:13] moritzm: Okay for me to merge "Assign salt grains for librenms"? [13:07:49] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [13:08:39] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [13:09:41] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [13:10:19] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [13:16:59] Coren: yes, sorry about that [13:17:19] moritzm: No worries, it happens to all of us now and then. :-) [13:19:01] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:00] Anyone on that restbase thing? [13:24:49] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [13:25:39] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [13:26:40] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [13:26:59] !log aude@tin Synchronized php-1.27.0-wmf.3/extensions/CirrusSearch: Add justMapping option to updateOneSearchIndexConfig script (duration: 00m 18s) [13:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:27:59] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [13:30:50] would help to updat ethe submodule... [13:32:39] !log aude@tin Synchronized php-1.27.0-wmf.3/extensions/CirrusSearch: Add justMapping option to updateOneSearchIndexConfig script (updated submodule) (duration: 00m 18s) [13:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:36:15] (03PS3) 10Muehlenhoff: Assign salt grains for graphite [puppet] - 10https://gerrit.wikimedia.org/r/248847 [13:37:44] (03CR) 10Ottomata: "Yes, had run it on cp1057 for a while, turned if off over the weekend. I just started it back up in a screen there. The test.reqstats.mi" [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [13:37:46] (03PS4) 10Muehlenhoff: Assign salt grains for graphite [puppet] - 10https://gerrit.wikimedia.org/r/248847 [13:38:07] (03CR) 10Ottomata: "are being*" [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [13:38:36] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for graphite [puppet] - 10https://gerrit.wikimedia.org/r/248847 (owner: 10Muehlenhoff) [13:41:10] (03CR) 10Alex Monk: "See I0d1ba430" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236979 (https://phabricator.wikimedia.org/T110199) (owner: 10Robmoen) [13:41:31] (03PS2) 10Muehlenhoff: Assign salt grains for openldap [puppet] - 10https://gerrit.wikimedia.org/r/248848 [13:42:07] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for openldap [puppet] - 10https://gerrit.wikimedia.org/r/248848 (owner: 10Muehlenhoff) [13:46:00] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [13:47:11] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [13:47:38] <_joe_> akosiaris, morebots ^^ [13:47:43] <_joe_> err mobrovac [13:49:31] <_joe_> I see some cassandra troubles probably [13:49:49] <_joe_> servers being detected as up, then down immediately [13:49:59] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [13:50:26] <_joe_> WARN [MessagingService-Outgoing-/10.64.0.123] 2015-10-26 13:49:49,992 OutboundTcpConnection.java:414 - Seed gossip version is -2147483648; will not connect with that version [13:50:30] <_joe_> INFO [HANDSHAKE-/10.64.0.123] 2015-10-26 13:49:49,993 OutboundTcpConnection.java:494 - Cannot handshake version with /10.64.0.123 [13:53:48] (03PS1) 10Muehlenhoff: Add missing Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/248865 [13:54:40] _joe_: I was trying to find those logs. Can you tell me where they are for future ref? [13:54:59] _joe_: Also, 1001 is having heating issues; I was about to open a ticket for Chris. [13:55:55] hm [13:57:12] 6operations, 10ops-eqiad: aqs1001 getting multiple and repeated heat MCEs - https://phabricator.wikimedia.org/T116584#1752956 (10coren) 3NEW a:3Cmjohnson [13:58:15] (03CR) 10Andrew Bogott: "What does the puppet compiler think about this one?" [puppet] - 10https://gerrit.wikimedia.org/r/247209 (owner: 10Muehlenhoff) [13:59:14] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1752966 (10akosiaris) Oh, it is basically reinitializing the slave. ``` stop postgres mv /srv/postgres/9.4/main/recovery.conf ~/ rm -rf /srv/postgr... [14:01:48] _joe_: hm, seems to have stabilised, getting all endpoints healthy now when running locally service_checker on aqs1002 [14:01:58] cass logs also say it's up [14:03:01] uf, aqs1001 cass is marked as down [14:03:04] (03CR) 10Muehlenhoff: [C: 04-1] "Same problem as with holmium, see http://puppet-compiler.wmflabs.org/1078/" [puppet] - 10https://gerrit.wikimedia.org/r/247209 (owner: 10Muehlenhoff) [14:04:00] 6operations, 7Graphite, 7Monitoring, 5Patch-For-Review: deprecate gdash - https://phabricator.wikimedia.org/T104365#1752970 (10faidon) 5Resolved>3Open Let's keep the task open until we actually remove the static site as well. [14:04:07] 6operations, 7Graphite, 7Monitoring, 5Patch-For-Review: deprecate gdash - https://phabricator.wikimedia.org/T104365#1752972 (10faidon) p:5Normal>3Low [14:04:16] (03PS2) 10Aude: Enable geosearch on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247837 [14:04:34] _joe_: not that I could help in this situation, but joal and I are the ones who should be looking after aqs [14:04:53] I'll try to learn about the maintenance from mobrovac and take that over [14:04:55] (03CR) 10Aude: [C: 032] Enable geosearch on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247837 (owner: 10Aude) [14:05:01] (03Merged) 10jenkins-bot: Enable geosearch on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247837 (owner: 10Aude) [14:05:05] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1752974 (10Ottomata) I'm still a little confused about how this reqid/id will work? You are suggesting that it comes from the x-request-id that we w... [14:05:56] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1752975 (10Ottomata) To avoid possible conflicts, I'd suggest we call this not just `id`. How about `uuid`? That's what EventLogging capsule does:... [14:07:09] RECOVERY - mysqld processes on db2065 is OK: PROCS OK: 1 process with command name mysqld [14:07:19] RECOVERY - mysqld processes on db2067 is OK: PROCS OK: 1 process with command name mysqld [14:07:22] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1752979 (10Ottomata) Also, this is just a personal preference, but I'd prefer if we had a convention differentiating integer/second based 'timestamps... [14:07:25] RECOVERY - mysqld processes on db2070 is OK: PROCS OK: 1 process with command name mysqld [14:07:38] jynus: all of these are paging [14:07:44] arg [14:07:55] they are all downtimed [14:08:12] pages that are only the recoveries ? [14:08:14] nice [14:09:09] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enable geosearch on test.wikidata (duration: 00m 17s) [14:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:09:52] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [14:10:04] <_joe_> Coren: heat problems are not real [14:10:12] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [14:10:22] <_joe_> Coren: the log lines were in the cassandra logs in /var/log/cassandra/... [14:10:55] _joe_: Ah, the other side of course. I tried to find where restbase was dumping its own. [14:11:11] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [14:11:59] (03PS1) 10ArielGlenn: dumps: fix camelcases in WikiDumps.py (part 1) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/248866 [14:12:01] (03PS1) 10ArielGlenn: dumps: camelcases in wikiDumps.py (part 2) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/248867 [14:14:20] (03PS1) 10Aude: Enable GeoData extension on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248868 (https://phabricator.wikimedia.org/T115482) [14:15:13] jynus: _joe_ is everything ok now? [14:15:21] * aude wants to deploy [14:15:40] if you refer to mysql, there is nothing wrong [14:15:45] ok [14:16:17] (03CR) 10Aude: [C: 032] Enable GeoData extension on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248868 (https://phabricator.wikimedia.org/T115482) (owner: 10Aude) [14:16:23] (03Merged) 10jenkins-bot: Enable GeoData extension on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248868 (https://phabricator.wikimedia.org/T115482) (owner: 10Aude) [14:16:49] so, in adition to downtime, that was aleady there, I have disabled notifications on those hosts [14:18:56] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enable GeoData on Wikidata (duration: 00m 17s) [14:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:08] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753027 (10Ottomata) Also, over at [[ https://phabricator.wikimedia.org/T88459#1694274 | T88459#1694274 ]], I commented: If we adopt a convention o... [14:35:52] PROBLEM - puppet last run on wtp1023 is CRITICAL: CRITICAL: Puppet has 1 failures [14:38:21] (03PS2) 10ArielGlenn: Remove class role::dataset::publicdirs, noop [puppet] - 10https://gerrit.wikimedia.org/r/246824 (owner: 10Faidon Liambotis) [14:39:31] jynus: we have a replication problem from m4-master to analytics-store (for eventlogging data) [14:39:32] (03CR) 10ArielGlenn: [C: 032] Remove class role::dataset::publicdirs, noop [puppet] - 10https://gerrit.wikimedia.org/r/246824 (owner: 10Faidon Liambotis) [14:39:44] jynus: nothing has been replicated since October 22nd for MobileWikiAppSearch_10641988 [14:39:54] (that's one example we know of, but there may be others) [14:40:04] letting you know here, I can file a ticket if you'd like [14:40:15] let me check [14:43:29] I see m4 lagging behind, but not broken [14:43:59] but only in the last 24 hours, not since the 22nd [14:45:10] (03PS1) 10EBernhardson: Send CirrusSearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248871 [14:45:25] so, a 2 hour lag [14:45:36] less than 1 hour, actually [14:45:38] (03CR) 10jenkins-bot: [V: 04-1] Send CirrusSearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248871 (owner: 10EBernhardson) [14:46:30] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1753065 (10Yurik) @akosiaris, thx, but i suspect we won't be able to do most of these steps due to perm? [14:47:02] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds [14:47:41] "INSERT IGNORE INTO `EchoInteraction_5782287`", done 45 minutes ago [14:48:38] didn't nuria stopped the imports for a while?, milimetric [14:48:46] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1753067 (10akosiaris) >>! In T116553#1753065, @Yurik wrote: > @akosiaris, thx, but i suspect we won't be able to do most of these steps due to perm?... [14:48:52] jynus: no, we stopped backfilling [14:48:59] jynus: but not the service [14:49:20] so, are these the backfillings ? [14:49:21] (03PS1) 10BBlack: Revert "remove cp1059 from ipsec hostlists - T114870" [puppet] - 10https://gerrit.wikimedia.org/r/248874 [14:49:40] (03CR) 10BBlack: [C: 032 V: 032] Revert "remove cp1059 from ipsec hostlists - T114870" [puppet] - 10https://gerrit.wikimedia.org/r/248874 (owner: 10BBlack) [14:49:49] jynus: there's some confusion [14:49:55] analytics store is lagging behind m4 [14:50:03] yes [14:50:04] m4 is not lagging as far as I can tell [14:50:16] what is the difference? [14:50:53] EL consumer writes to m4 which then replicates to analytics-store [14:50:55] m4 replication is right now 2500 seconds behind [14:51:02] its master [14:51:21] its master is m4-master [14:51:23] sorry, maybe we're saying the same thing and I'm bad at the terminology [14:51:30] it is ok [14:51:35] there is a slave delay [14:51:44] my point is, regardless of what the replag monitoring is telling us, executing this gives very different results: [14:51:53] select max(timestamp) from log.MobileWikiAppSearch_10641988; [14:51:54] my point is, maybe it is due to backlog that has been imported recently? [14:52:09] so a temporary thing [14:52:12] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [14:52:16] I see two spikes [14:52:18] jynus: we only started backfilling this morning, this lag is back to October 22nd (4 days) [14:52:25] no [14:52:34] that I can tell you is is not true [14:52:41] https://tendril.wikimedia.org/host/view/dbstore2002.codfw.wmnet/3306 [14:52:43] look [14:52:47] right, i did look at that [14:52:51] at the Replication graph [14:53:00] so then how do we explain the absence of data past 20151022145533 on analytics-store? [14:53:04] there was lag on the 22 [14:53:07] that I caused [14:53:19] then it recovered [14:53:20] yep, when you were helping mforns [14:53:28] and now there is lag again [14:53:31] due to large inserts [14:53:35] but it seems to me at that point it stopped replicating a bunch of tables [14:53:45] the recent lag is unrelated to the problem I'm talking about [14:54:04] 20151022145533 == select max(timestamp) from log.MobileWikiAppSearch_10641988; [14:54:26] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1753077 (10Yurik) 5Resolved>3Open Reopening - we should be able to recover from the replication failures. T116553#1752966 outlines steps that we s... [14:54:46] ok, let's talk on #wikimedia-databases to not flood this channel, as it is a very specific thing [14:54:55] sorry -ops, good point, brt [14:56:09] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1752146 (10Yurik) Re-opened sudo task with extra info T106637. [14:56:50] (03PS2) 10ArielGlenn: dataset: move system user creation to module [puppet] - 10https://gerrit.wikimedia.org/r/246825 (owner: 10Faidon Liambotis) [14:57:16] _joe_: It's still an issue even if the MCEs are tripped by false positives since that makes the kernel throttle the "overheating" CPUs [14:57:51] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [14:58:10] (03CR) 10ArielGlenn: [C: 032] dataset: move system user creation to module [puppet] - 10https://gerrit.wikimedia.org/r/246825 (owner: 10Faidon Liambotis) [14:58:55] !log repooling cp1059 varnish mobile frontend (wiped) [14:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:05] anomie ostriches thcipriani marktraceur: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151026T1500). Please do the needful. [15:00:05] ebernhardson Glaisher: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:12] here [15:01:01] RECOVERY - puppet last run on wtp1023 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:01:18] ottomata: ping re the new reqstats stuff and diamond? random cache hosts I look at now have 2x diamond processes consuming as much CPU as varnish itself.... [15:01:23] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:01:34] here [15:01:40] ottomata: (with start times on Friday) [15:01:42] i have pretty much all the patches, i can just ship these out [15:02:20] Glaisher: you have a -1 from Krenair about a dependency. It's been resolved? [15:02:33] looks like its proofreadpage, which looks to be out [15:02:40] ebernhardson: yeah [15:02:54] (03CR) 10EBernhardson: [C: 032] "proofread page has been deployed now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240640 (https://phabricator.wikimedia.org/T54709) (owner: 10Glaisher) [15:03:20] (03Merged) 10jenkins-bot: Remove Page and Index namespaces from $wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240640 (https://phabricator.wikimedia.org/T54709) (owner: 10Glaisher) [15:04:20] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Remove Predundant Page and Index namespaces from $wgContentNamespaces (duration: 00m 17s) [15:04:22] Glaisher: ^ [15:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:42] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/WikimediaEvents.php: Update satisfaction schema id due to bad varnish caching of old id (duration: 00m 17s) [15:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:00] ebernhardson: it would be a no-op and nothing seems to have broken on the wikis [15:07:23] Glaisher: sounds right, thanks for checking [15:08:21] Coren: what host has the heat issue / MCE logs? [15:08:22] (03CR) 10ArielGlenn: "The reason these classes weren't inlined is that we have a history of needing to move the jobs back and forth between the primary and seco" [puppet] - 10https://gerrit.wikimedia.org/r/246826 (owner: 10Faidon Liambotis) [15:08:34] bblack: aqs1001 [15:08:40] Coren: I used to think they weren't real either, but we ended up fixing several eqiad cache hosts and it was real [15:09:19] bblack: I have no opinion on the actual thermal issue; that needs feet on the ground. :-) [15:09:22] (03PS1) 10Rush: openstack: cleanup up old repo setups [puppet] - 10https://gerrit.wikimedia.org/r/248882 [15:10:13] well from a software perspective, the temp sensor is showing 90C, which is high for most of our hardware, and the race of MCEs corresponds as well [15:10:26] * Coren nods. [15:10:31] Hence the phab task. :-) [15:10:40] where? [15:11:23] ah found it [15:13:11] 6operations, 10ops-eqiad: aqs1001 getting multiple and repeated heat MCEs - https://phabricator.wikimedia.org/T116584#1753155 (10BBlack) Note this is showing ~90-91C on the software read of the temp sensors as well: ``` root@aqs1001:~# cat /sys/class/thermal/thermal_zone*/temp 91000 90000 ``` This seems simi... [15:13:11] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/modules/: Move search schema from cirrussearch -> wikimediavents (duration: 00m 17s) [15:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:46] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/WikimediaEvents.php: Move search schema from cirrussearch -> wikimediavents (duration: 00m 19s) [15:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:09] something went wrong [15:14:31] MediaWiki internal error. [15:14:31] Exception caught inside exception handler. [15:14:31] Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information. [15:14:34] "MediaWiki internal error. [15:14:34] Exception caught inside exception handler." [15:14:36] yep [15:14:45] bblack^^ [15:15:04] Went to edit a page and got that [15:15:14] You doing anything right now? [15:15:18] ebernhardson: roll back? Looks really broken [15:15:20] "MediaWiki internal error. [15:15:22] Exception caught inside exception handler. [15:15:23] ebernhardson: ^ [15:15:24] Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information." [15:15:25] ebernhardson: ^^^ [15:15:25] Refreshed still happening [15:15:25] seeing on en.wiki and mediawiki.org [15:15:26] lol [15:15:28] everyone ... [15:15:31] PROBLEM - HHVM rendering on mw1033 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.080 second response time [15:15:31] PROBLEM - HHVM rendering on mw1052 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.072 second response time [15:15:31] PROBLEM - HHVM rendering on mw1206 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.116 second response time [15:15:31] PROBLEM - HHVM rendering on mw1095 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.080 second response time [15:15:31] PROBLEM - HHVM rendering on mw2094 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.151 second response time [15:15:32] PROBLEM - HHVM rendering on mw2075 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.152 second response time [15:15:32] PROBLEM - HHVM rendering on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.082 second response time [15:15:32] +comma [15:15:32] PROBLEM - HHVM rendering on mw1097 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.074 second response time [15:15:32] PROBLEM - HHVM rendering on mw2185 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.146 second response time [15:15:33] PROBLEM - HHVM rendering on mw2049 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.142 second response time [15:15:33] PROBLEM - HHVM rendering on mw1246 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.045 second response time [15:15:45] uh oh [15:15:45] PROBLEM - HHVM rendering on mw1057 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.083 second response time [15:15:45] PROBLEM - HHVM rendering on mw1106 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.091 second response time [15:15:47] PROBLEM - HHVM rendering on mw1141 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.075 second response time [15:15:47] PROBLEM - HHVM rendering on mw1235 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.113 second response time [15:15:47] PROBLEM - HHVM rendering on mw1039 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.078 second response time [15:15:47] PROBLEM - HHVM rendering on mw2137 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.144 second response time [15:15:48] PROBLEM - HHVM rendering on mw2097 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.146 second response time [15:15:48] PROBLEM - HHVM rendering on mw2101 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.141 second response time [15:15:49] PROBLEM - HHVM rendering on mw2066 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.144 second response time [15:15:49] PROBLEM - HHVM rendering on mw1193 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.083 second response time [15:15:52] PROBLEM - HHVM rendering on mw1211 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.062 second response time [15:15:52] PROBLEM - HHVM rendering on mw1172 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.072 second response time [15:15:52] PROBLEM - HHVM rendering on mw1188 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.067 second response time [15:15:52] PROBLEM - HHVM rendering on mw2112 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.149 second response time [15:15:52] PROBLEM - HHVM rendering on mw2043 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.141 second response time [15:15:52] PROBLEM - HHVM rendering on mw2145 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.129 second response time [15:15:53] PROBLEM - HHVM rendering on mw2167 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.147 second response time [15:16:13] "Okay... Who Brought the dog?" .... [15:16:16] ;) [15:16:30] https://phabricator.wikimedia.org/T116593 <-- if someone wants to add some "the world is burning" :p [15:16:32] omg [15:16:43] 6operations, 10ops-eqiad, 10Traffic, 5Patch-For-Review: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1753178 (10Cmjohnson) I do have thermal paste on-site. Let me know when you want to schedule downtime on each of these. [15:17:09] tsk tsk I was first [15:17:15] lol [15:17:23] :-D [15:17:55] is rollback in progress? [15:18:02] MWException from line 331 of /srv/mediawiki/php-1.27.0-wmf.3/includes/resourceloader/ResourceLoader.php: ResourceLoader duplicate registration error. Another module has already been registered as schema.Search [15:18:15] <_joe_> what the hell is happening? [15:18:16] PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1080 bytes in 0.603 second response time [15:18:17] ebernhardson: ? [15:18:23] _joe_: It broke [15:18:24] PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1009 bytes in 0.077 second response time [15:18:25] rolling back, but they were just js changes [15:18:26] we have pretty clear timing coincidence with the deploy [15:18:28] <_joe_> who deployed what? [15:18:37] <_joe_> ROLL BACK FFS [15:18:41] PROBLEM - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1041 bytes in 0.569 second response time [15:18:43] is it all LVS? [15:18:47] PROBLEM - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1012 bytes in 0.447 second response time [15:18:53] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1051 bytes in 0.464 second response time [15:18:59] 6operations: MediaWiki internal error. - https://phabricator.wikimedia.org/T116596#1753247 (10zhuyifei1999) 3NEW [15:18:59] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1080 bytes in 0.488 second response time [15:19:01] is the thermal issue and the complete downing of Wikimedia related?) [15:19:04] LVS is secondary [15:19:05] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1050 bytes in 0.588 second response time [15:19:11] thermal issue is unrelated [15:19:15] the problem is the code deploy [15:19:16] 6operations, 10Wikimedia-General-or-Unknown: Exception caught inside exception handler - https://phabricator.wikimedia.org/T116593#1753255 (10Glaisher) Caused by https://gerrit.wikimedia.org/r/248877 rollback in progress ``` MWException from line 331 of /srv/mediawiki/php-1.27.0-wmf.3/includes/resourceloader/... [15:19:17] _joe_: ebernhardson is rolling back now, it started right after a sync [15:19:24] 15:13 < logmsgbot> !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/WikimediaEvents.php [15:19:25] 6operations: MediaWiki internal error. - https://phabricator.wikimedia.org/T116596#1753257 (10JohnLewis) [15:19:27] ^ and related [15:19:27] 6operations, 10Wikimedia-General-or-Unknown: Exception caught inside exception handler - https://phabricator.wikimedia.org/T116593#1753258 (10JohnLewis) [15:19:30] bblack, huh. well, the diamond stuff doesn't really work. it was working for several days fine, then certain processes started segfauting [15:19:33] PROBLEM - LVS HTTPS IPv4 on text-lb.codfw.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1080 bytes in 0.279 second response time [15:19:34] not now [15:19:37] ok [15:19:40] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 970 bytes in 0.088 second response time [15:19:40] 6operations: MediaWiki internal error. - https://phabricator.wikimedia.org/T116596#1753260 (10zhuyifei1999) [15:19:42] 6operations, 10Wikimedia-General-or-Unknown: Exception caught inside exception handler - https://phabricator.wikimedia.org/T116593#1753261 (10zhuyifei1999) [15:19:46] PROBLEM - LVS HTTPS IPv4 on mobile-lb.codfw.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1042 bytes in 0.272 second response time [15:19:55] bblack, will just tell you: i'm going to submit a separate patch to disable all of those diamond collectors. [15:20:00] ok thanks [15:20:07] what is taking so long? [15:20:10] <_joe_> chasemp: is it rolled back? [15:20:14] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50994 bytes in 0.577 second response time [15:20:17] not yet [15:20:17] still working, its multiple patches [15:20:21] PROBLEM - LVS HTTPS IPv6 on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1051 bytes in 0.270 second response time [15:20:26] 6operations, 6Release-Engineering-Team, 10Wikimedia-General-or-Unknown: Exception caught inside exception handler - https://phabricator.wikimedia.org/T116593#1753268 (10QuimGil) [15:20:29] PROBLEM - LVS HTTPS IPv6 on mobile-lb.codfw.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8844 bytes in 0.245 second response time [15:20:35] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:55] PROBLEM - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8839 bytes in 0.378 second response time [15:20:57] _joe_: paravoid: see it's multiple patches ^^ https://gerrit.wikimedia.org/r/#/q/owner:%22EBernhardson+%253Cebernhardson%2540wikimedia.org%253E%22,n,z [15:21:02] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 557 bytes in 0.070 second response time [15:21:08] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents: rollback (duration: 00m 18s) [15:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:39] <_joe_> chasemp: ok, the best way to do this usually could be to just roll back on tin (git reset --hard HEAD~N) [15:21:49] (03PS1) 10Ottomata: Disable all diamond varnishreqstats collectors [puppet] - 10https://gerrit.wikimedia.org/r/248888 (https://phabricator.wikimedia.org/T83580) [15:21:50] RECOVERY - HHVM rendering on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 69551 bytes in 0.101 second response time [15:21:50] RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.125 second response time [15:21:50] RECOVERY - HHVM rendering on mw1041 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.125 second response time [15:21:50] RECOVERY - HHVM rendering on mw2175 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.260 second response time [15:21:50] RECOVERY - HHVM rendering on mw2111 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.769 second response time [15:21:51] RECOVERY - HHVM rendering on mw2076 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.743 second response time [15:21:51] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.201 second response time [15:21:51] RECOVERY - HHVM rendering on mw2130 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.869 second response time [15:21:51] RECOVERY - HHVM rendering on mw1128 is OK: HTTP OK: HTTP/1.1 200 OK - 69553 bytes in 1.307 second response time [15:21:52] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.302 second response time [15:21:53] RECOVERY - HHVM rendering on mw2174 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.737 second response time [15:21:53] RECOVERY - HHVM rendering on mw2205 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.320 second response time [15:21:53] <_joe_> and then merge the changes [15:21:54] Peerhaps it's time code deployments had a sandbox/beta server first? [15:22:00] <_joe_> heh [15:22:03] 6operations, 6Release-Engineering-Team, 10Wikimedia-General-or-Unknown: Exception caught inside exception handler - https://phabricator.wikimedia.org/T116593#1753305 (10Glaisher) [15:22:06] RECOVERY - HHVM rendering on mw1038 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.137 second response time [15:22:06] RECOVERY - HHVM rendering on mw1112 is OK: HTTP OK: HTTP/1.1 200 OK - 69551 bytes in 0.118 second response time [15:22:07] RECOVERY - HHVM rendering on mw1071 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.143 second response time [15:22:07] RECOVERY - HHVM rendering on mw1070 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.153 second response time [15:22:07] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.151 second response time [15:22:07] RECOVERY - HHVM rendering on mw1101 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.145 second response time [15:22:07] RECOVERY - HHVM rendering on mw1096 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.127 second response time [15:22:09] ShakespeareFan00: funnily enough they do [15:22:19] and this code has been deployed there since last week [15:22:19] ok recoveries started [15:22:20] <_joe_> JohnFLewis: no? [15:22:26] So technically deploys shouldn't cause meltdowns? [15:22:30] back to normal for me [15:22:34] <_joe_> yup recoveries [15:22:49] looks better now yes, and was def from the deploy [15:22:50] _joe_: beta cluster should be said thing [15:22:53] (03CR) 10Ottomata: [C: 032] Disable all diamond varnishreqstats collectors [puppet] - 10https://gerrit.wikimedia.org/r/248888 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [15:22:54] <_joe_> who's the lucky one who has to write the incident docks? [15:22:57] ebernhardson: next time don't bother with git [15:23:03] [Exception MWException] (/srv/mediawiki/php-1.27.0-wmf.3/includes/resourceloader/ResourceLoader.php:331) ResourceLoader duplicate registration error. Another module has already been registered as schema.Search [15:23:03] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753319 (10mobrovac) >>! In T116247#1752974, @Ottomata wrote: > I'm still a little confused about how this reqid/id will work? You are suggesting th... [15:23:05] don't bother with patches, i mean [15:23:07] fwiw [15:23:22] revert on tin, sync, then worry about git [15:23:28] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753324 (10mobrovac) [15:23:37] <_joe_> ori: my point too :) [15:23:39] paravoid: interesting, ok. That kind of error should probably just log and not fatal the whole site [15:23:40] started at 15:13 [15:23:46] 6operations, 10ops-eqiad, 10Traffic, 5Patch-For-Review: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1753326 (10BBlack) @cmjohnson above I think meant for T116584 [15:23:57] ebernhardson: actionable for the postmortem that you'll write :) [15:24:04] :) [15:24:13] not a fun way to open my laptop monday morning :) [15:24:19] paravoid: indeed [15:24:19] <_joe_> paravoid: you beat me to that joke :P [15:24:22] :D [15:24:24] ah there's the root cause, greg-g opened his laptop [15:24:30] bblack: :P [15:24:38] beware of logging issues now [15:24:40] <_joe_> yup, we don't trust neckbeards [15:25:00] that is 1 million errors/s [15:25:01] as is the fact that this wasn't caught in QA [15:25:26] RECOVERY - HHVM rendering on mw1057 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.133 second response time [15:25:26] RECOVERY - HHVM rendering on mw2156 is OK: HTTP OK: HTTP/1.1 200 OK - 69555 bytes in 0.250 second response time [15:25:27] RECOVERY - HHVM rendering on mw1106 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.130 second response time [15:25:27] RECOVERY - HHVM rendering on mw1092 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.131 second response time [15:25:28] RECOVERY - HHVM rendering on mw1039 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.121 second response time [15:25:28] RECOVERY - HHVM rendering on mw1141 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.350 second response time [15:25:29] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.368 second response time [15:25:29] RECOVERY - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 10309 bytes in 0.522 second response time [15:25:30] RECOVERY - HHVM rendering on mw2137 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.269 second response time [15:25:30] RECOVERY - HHVM rendering on mw2101 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.262 second response time [15:25:31] RECOVERY - HHVM rendering on mw2097 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.297 second response time [15:25:31] RECOVERY - HHVM rendering on mw2066 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.273 second response time [15:25:31] <_joe_> jynus: right now? or just before? [15:25:41] before [15:25:50] paravoid: it looks like the issue with QA was order of patches applied [15:25:53] althoug there is some lag on monitoring, I suppose related to this [15:25:53] <_joe_> icinga-wm getting kicked is kind of not so good [15:26:00] paravoid: the code overall works, but its spread across multiple repositories [15:26:18] _joe_: better than flooding the channel [15:26:20] <_joe_> jynus: yup probably related [15:26:40] ebernhardson: that's still something to learn from and protect against in the future though [15:26:42] <_joe_> paravoid: heh, if it just stays out for 1 minute, fair enough [15:26:43] finished at 15:23, that is 10 minutes of outage [15:27:36] maybe 15:20 + tail [15:28:02] 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring, 3labs-sprint-118: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1753375 (10Andrew) [15:28:06] 6operations, 10ops-eqiad, 5Patch-For-Review: cp1059 has network issues - https://phabricator.wikimedia.org/T114870#1753377 (10BBlack) 5Open>3Resolved cp1059 was stable for 6 days in icinga, seems fixed. Repooled and undowntimed (with cleared caches just in case). [15:29:54] 6operations, 10ops-eqiad: aqs1001 getting multiple and repeated heat MCEs - https://phabricator.wikimedia.org/T116584#1753383 (10Cmjohnson) I have thermal pate on-site. Let me know when you would like to schedule downtime to try the fix. Chris [15:30:11] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 3labs-sprint-118: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1753384 (10Andrew) [15:30:28] 6operations, 10ops-eqiad: Decommission sodium - https://phabricator.wikimedia.org/T110142#1753385 (10Cmjohnson) removed switch information [15:30:59] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/CirrusSearch: Undeploy eventlogging search schema from CirrusSearch (duration: 00m 18s) [15:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:27] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753398 (10Ottomata) > I don't see a conflicting problem with id (even though id is a JSONSchema keyword, but it relates to the schema, not its prope... [15:33:58] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753399 (10Eevans) >>! In T116247#1749452, @Ottomata wrote: > Right, but how would you do this in say, Hive? Or in bash? In bash: ``` $ sudo apt-g... [15:34:43] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753400 (10Ottomata) > Manual schema versions. We could increase the schema version every time we change something in the schema. Easy to achieve but... [15:36:17] (03CR) 10Giuseppe Lavagetto: [C: 032] Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [15:36:29] <_joe_> ori: ^^ congratulations :) [15:36:40] \o/ [15:37:37] 6operations, 6Analytics-Backlog, 10Wikimedia-Mailing-lists: Requests to lists.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116429#1753431 (10Nuria) In order to get this requests in hadoop this domain needs to be fronted by varnish, by looking through pu... [15:37:54] PROBLEM - puppet last run on mw1067 is CRITICAL: CRITICAL: Puppet has 1 failures [15:38:26] <_joe_> ori: once I fix a small glitch with idleconnectionmonitor I think we can cut out a new package and start using etcd for reals here [15:38:54] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:42:26] (03Merged) 10jenkins-bot: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh) [15:43:41] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/modules: Re-deploy WME changes after deploying necessary CirrusSearch change first (duration: 00m 17s) [15:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:15] site is still up, this time, ebernhardson :) [15:44:39] greg-g: that was only the js, next one is what broke it before :P [15:44:47] oh [15:44:52] here we go... [15:44:58] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/WikimediaEvents.php: Re-deploy WME changes after deploying necessary CirrusSearch change first (duration: 00m 18s) [15:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:12] * greg-g nods [15:45:34] still looks sane [15:45:53] yeah [15:51:09] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1753497 (10Selsharbaty-WMF) Hi John, Yeah, this is really helpful. I can never thank you enough! [15:54:13] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753513 (10mobrovac) >>! In T116247#1753398, @Ottomata wrote: > Ok cool, if that's the case, then `reqid` or even `request_id` (I like long names...w... [15:55:25] one last swat patch almost forgot about, that turns writes on to CODFW for cirrussearch. just going to punt that back to evening swat [15:55:57] starting an interview in a few minuts and wont be able to watch it [15:56:27] yeah, good call :) [15:57:22] it is an ops interview, could ask them how to fix it ;) [15:58:44] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [16:01:46] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1753574 (10ArielGlenn) Any movement on this front? Is that spare still around? [16:02:44] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753579 (10Ottomata) > Hm, I think duplicates should be detected based on the content of the message itself and the time stamp. EventLogging explicit... [16:03:05] RECOVERY - puppet last run on mw1067 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:06:41] 7Puppet, 6operations, 10Continuous-Integration-Config: translatewiki-puppetlint-strict does not honor puppet-lint.rc file in /puppet - https://phabricator.wikimedia.org/T116552#1753608 (10hashar) [16:10:09] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1753620 (10Dzahn) a:5akosiaris>3Dzahn [16:11:22] 7Puppet, 6operations, 10Continuous-Integration-Config: translatewiki-puppetlint-strict does not honor puppet-lint.rc file in /puppet - https://phabricator.wikimedia.org/T116552#1753624 (10hashar) The Jenkins job is a template '{name}-puppetlint-strict' and indeed runs at the root of the repository. Potentia... [16:16:58] 7Puppet, 6Labs, 6Phabricator: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1753651 (10Dzahn) Why can't we use the same class in prod and labs? That's the idea of testing changes, [16:24:26] 6operations, 10ops-eqiad, 3labs-sprint-118: nobelium is overheating - https://phabricator.wikimedia.org/T116439#1753689 (10yuvipanda) [16:24:38] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1753690 (10Andrew) 3NEW a:3Andrew [16:25:01] (03CR) 10Faidon Liambotis: "I'd like to hear more about why we've needed this in the past, but in any case how is moving three lines different than moving one?" [puppet] - 10https://gerrit.wikimedia.org/r/246826 (owner: 10Faidon Liambotis) [16:25:17] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1753690 (10Andrew) [16:25:43] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:26:16] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1753700 (10RobH) I was chatting about this with Andrew. So since all mgmt is on 'dumb' switches, we don't support multiple mgmt vlans unless... [16:37:36] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1753783 (10chasemp) If the idea is these physical boxes are totally under the control of the relevant project admins we should consider mimic... [16:42:31] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1753813 (10Joe) I will start working on this in the next couple of weeks. My current plan **for tin ** is to ask people to use... [16:46:58] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1753827 (10BBlack) What are we blocked on here currently, do we need to order more SFPs or something to try plugging these in again? [16:53:52] (03PS1) 10Jforrester: Enable VisualEditor in the 'Projet' namespace on the French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248910 (https://phabricator.wikimedia.org/T116603) [16:58:31] 7Blocked-on-Operations, 6operations, 10Parsoid, 10Salt, 6Scrum-of-Scrums: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1753883 (10ArielGlenn) I've been looking at this and seeing a couple of behaviours, one where I... [17:00:11] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1753896 (10Cmjohnson) Let's go with Rob's suggestion [17:05:22] (03PS2) 10Dzahn: releases: move base::firewall into the role [puppet] - 10https://gerrit.wikimedia.org/r/244691 (owner: 10Muehlenhoff) [17:06:21] (03CR) 10Dzahn: [C: 032] releases: move base::firewall into the role [puppet] - 10https://gerrit.wikimedia.org/r/244691 (owner: 10Muehlenhoff) [17:07:45] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1753924 (10Cmjohnson) just a thought but some of those errors could be related to sfp's.....this occured when i had put standard sfp's in and not sfp+. May be worth cabling up again and seeing if we get... [17:08:55] (03CR) 10Rush: [C: 04-1] "can you ref the issue this is for?" [puppet] - 10https://gerrit.wikimedia.org/r/209558 (owner: 10coren) [17:10:42] (03PS1) 10Alexandros Kosiaris: Update WikimediaEnableMultiLines to OTRS 5.0.1 [software/otrs] - 10https://gerrit.wikimedia.org/r/248915 [17:10:44] (03PS1) 10Alexandros Kosiaris: Update WikimediaTemplates to support 5.0.1 [software/otrs] - 10https://gerrit.wikimedia.org/r/248916 [17:12:23] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1753953 (10RobH) a:5Cmjohnson>3RobH [17:12:30] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1720246 (10RobH) p:5Normal>3High [17:13:04] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1753968 (10Andrew) [17:14:13] (03PS4) 10Dzahn: Move the ferm rules for elasticsearch internode traffic into role::logstash::elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/244412 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [17:14:22] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1079/" [puppet] - 10https://gerrit.wikimedia.org/r/244412 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [17:14:34] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1753981 (10RobH) I'll handle getting this spun up, and any potential onsite tasks. (since it responds to mgmt ssh, there likely won't be any other than the labeling task) I'll get this nam... [17:16:51] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1753993 (10GWicke) [17:17:31] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1687618 (10GWicke) Updated the ask to two boxes per DC in the description. [17:18:09] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1754006 (10Andrew) a:5Andrew>3None [17:18:11] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1754008 (10Andrew) a:5Andrew>3None [17:19:58] (03PS1) 10Dzahn: logstash::elasticsearch add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/248918 (https://phabricator.wikimedia.org/T104964) [17:20:16] moritzm: ^ [17:21:07] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1754025 (10RobH) a:5Ottomata>3RobH [17:22:45] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1754037 (10BBlack) yeah @paravoid was saying in the meeting, basically we should try one and see how it goes. Let's cable/plug in the asw-d-eqiad connection for lvs1007 first and see what it does? Shou... [17:23:14] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [500.0] [17:23:46] bd808: is this right? it seems to me it is. "Restrict access to deployment redis to internal plus silver" https://gerrit.wikimedia.org/r/#/c/245876/ [17:26:19] ebernhardson: I talked to mark, it is 6weeks once we get the deployment done now :) (to nobelium) [17:26:53] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 3labs-sprint-118: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1754054 (10chasemp) a:5Cmjohnson>3chasemp [17:26:53] (03CR) 10BryanDavis: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/245876 (owner: 10Muehlenhoff) [17:27:41] (03PS3) 10Dzahn: Restrict access to deployment redis to internal plus silver [puppet] - 10https://gerrit.wikimedia.org/r/245876 (owner: 10Muehlenhoff) [17:28:26] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1080/" [puppet] - 10https://gerrit.wikimedia.org/r/245876 (owner: 10Muehlenhoff) [17:28:35] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:30:48] (03CR) 10Alexandros Kosiaris: [C: 032] "Merging after ops meeting OKed this." [puppet] - 10https://gerrit.wikimedia.org/r/244627 (https://phabricator.wikimedia.org/T115067) (owner: 10Dzahn) [17:30:56] (03PS2) 10Alexandros Kosiaris: admin: let kartotherian and tilerator admins read logs [puppet] - 10https://gerrit.wikimedia.org/r/244627 (https://phabricator.wikimedia.org/T115067) (owner: 10Dzahn) [17:31:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] admin: let kartotherian and tilerator admins read logs [puppet] - 10https://gerrit.wikimedia.org/r/244627 (https://phabricator.wikimedia.org/T115067) (owner: 10Dzahn) [17:32:30] :) [17:34:46] YuviPanda: sweet! [17:35:01] YuviPanda: i never expected to take so long to get everything ready...but good to know we have time available now :) [17:35:31] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1754110 (10akosiaris) 5Open>3Resolved Change merged and tested. Resolving [17:35:43] (03PS1) 10RobH: adding in globalsign to procurement approved vendors [puppet] - 10https://gerrit.wikimedia.org/r/248921 [17:37:33] (03CR) 10RobH: [C: 032] adding in globalsign to procurement approved vendors [puppet] - 10https://gerrit.wikimedia.org/r/248921 (owner: 10RobH) [17:37:53] (03CR) 10Dzahn: [C: 032] Add DNS entries for ms-be20[1-2][0-6] Bug:T114712 [dns] - 10https://gerrit.wikimedia.org/r/248712 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul) [17:38:17] (03CR) 10Dzahn: "looks all good. free IPs in the original mgmt range, matches racktables" [dns] - 10https://gerrit.wikimedia.org/r/248712 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul) [17:38:26] (03CR) 10Yuvipanda: [C: 031] dynamicproxy: Empty data from initial-data.db [puppet] - 10https://gerrit.wikimedia.org/r/248622 (owner: 10Alex Monk) [17:38:26] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1754123 (10GWicke) > If we adopt a convention of always storing schema name and/or revision in the schemas themselves, then we can do like EventLoggi... [17:40:13] wait, ehm [17:40:18] i just merged the DNS change above [17:40:29] but authdns-update diff shows me more of a change than that [17:41:02] was another merge pending from earlier? [17:41:48] i think from yesterday, yes [17:42:17] cmjohnson1: ^ ms-be1019 be1020 in eqiad? yesterday? [17:42:50] yeah but I merged that ...didn't I? [17:43:04] in authdns-update i see it as a change [17:43:12] but unlike puppet-merge this won't tell me "2 changes, warning" [17:43:14] oh..that would explain why they're not installing [17:43:17] it will just show me the unified diff [17:43:22] and that was confusing [17:43:40] ok, so let me merge this together [17:43:56] okay...sorry for the confusions [17:44:00] done, try it again then [17:44:01] 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1754144 (10greg) Jan: can you put in the description why you are requesting access? :) [17:44:13] np [17:44:27] hmm, papaul? [17:44:58] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1754145 (10GWicke) > I'm not so sure actually that these will always be redundant. I think the request ID should be persisted to track the same event... [17:45:23] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 3 failures [17:45:58] (03PS2) 10MaxSem: Beta: add cache headers to WP portal [puppet] - 10https://gerrit.wikimedia.org/r/248374 [17:48:07] 6operations, 6Labs, 10Labs-Infrastructure, 3labs-sprint-117: Allocate labs subnet in dallas - https://phabricator.wikimedia.org/T115491#1754167 (10mark) For the Labs hosts/support vlans we can just follow the eqiad model for now, and copy that to codfw, with similar IP allocations as well. [17:49:47] 6operations, 6Labs, 10Labs-Infrastructure, 3labs-sprint-117: Allocate subnet for labs test cluster instances - https://phabricator.wikimedia.org/T115492#1754169 (10mark) Let's allocate a similar amount of IPs (/20 iirc or thereabouts) as we did in eqiad, but experiment with how to use them with Neutron. I... [17:50:24] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 3 failures [17:54:04] 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1754188 (10Dzahn) >>! In T115937#1749505, @Reedy wrote: > I guess we ideally need /home/wikipedia/conf-svn/wmf-config for the actual svn repo... I restored the entire /home/wikipedia from /home_pmtpa/wiki... [17:55:24] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 3 failures [17:56:50] 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1754219 (10Dzahn) @akosiaris @arielglenn @tstarling any ideas where else we could get "conf-svn"? I don't know after trying the old home_pmtpa that was mounted once on bast1001. [17:56:56] 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1754221 (10Dzahn) a:5Dzahn>3None [17:57:37] 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1754222 (10Dzahn) p:5Triage>3Normal [17:59:15] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 71 data above and 7 below the confidence bounds [18:00:24] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 2 failures [18:01:00] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1754247 (10Dzahn) How common is this task? From T116553 i'm not sure this should be a part of normal admin work but rather an exception that we need mo... [18:04:45] 6operations, 7Monitoring, 5Patch-For-Review: Monitor APC usage on application servers - https://phabricator.wikimedia.org/T116255#1754252 (10Krinkle) Initial dashboard up at . [18:05:24] RECOVERY - check_puppetrun on bismuth is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [18:08:55] 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1754262 (10EBernhardson) 3NEW [18:09:13] (03CR) 10Andrew Bogott: [C: 032] openstack: Remove havana/icehouse files [puppet] - 10https://gerrit.wikimedia.org/r/248619 (owner: 10Alex Monk) [18:09:18] (03PS2) 10Andrew Bogott: openstack: Remove havana/icehouse files [puppet] - 10https://gerrit.wikimedia.org/r/248619 (owner: 10Alex Monk) [18:17:33] (03PS1) 10Dzahn: test multi-role admin group behaviour [puppet] - 10https://gerrit.wikimedia.org/r/248928 [18:21:08] godog: aargg.. ^ the actual answer is not "merge" but "fail" :( [18:21:11] Error: Could not run: Conflicting value for admin::groups found in role test_bar [18:21:37] that succkkks [18:22:14] JohnFLewis: ^ fyi, too [18:23:21] I guess you need to set the admin groups in a variable (via hiera) and make admin look that up… not nice, though [18:25:12] (03CR) 10Dzahn: "fails :/ Error: Could not run: Conflicting value for admin::groups found in role test_bar" [puppet] - 10https://gerrit.wikimedia.org/r/248928 (owner: 10Dzahn) [18:25:51] (03CR) 10Dzahn: admin: add dc-ops group to role access_new_install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/246850 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn) [18:26:19] hoo: that is via hiera :) [18:26:45] mutante: do we just look up the hiera variable? no merging? [18:26:57] * JohnFLewis looks [18:27:10] oh [18:27:13] * hoo hides [18:27:19] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1754366 (10mobrovac) a:3mobrovac [PR 5](https://github.com/wikimedia/restevent/pull/5) proposes the schema definitions for the basic MW events: art... [18:28:50] hoo: JohnFLewis: yes, this is via hiera. the issues comes up once you have 2 or more roles and each role assigns admin groups and then you put 2 roles on one node [18:29:03] mutante: yeah, {{looking}} [18:29:12] so that means we can define the admin groups in hiera, but only by hostnames [18:29:16] and not by roles. which sucks [18:29:33] well. or we have to use regex.yaml [18:29:44] mutante: nooo :( [18:30:04] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:43] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [18:36:24] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1754445 (10Yurik) @dzahn, I think it has happened twice already. This goes back to the origin of this task - we should be able to manage all aspects o... [18:37:55] (03PS3) 10MaxSem: Beta: add cache headers to WP portal [puppet] - 10https://gerrit.wikimedia.org/r/248374 [18:40:04] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:41:54] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING [18:44:41] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1754469 (10Dzahn) I don't think rebuilding from replication failures should be considered normal and part of regular admin work. We should instead focu... [18:55:50] (03PS1) 10Dzahn: admin: add datacenter-ops to palladium [puppet] - 10https://gerrit.wikimedia.org/r/248936 [19:04:47] ori: do you know of any custom code to handle replication among master and slave in eventlogging db (we are talking with jynus in #wikimedia-databases) [19:07:49] 6operations, 6Phabricator, 10Traffic, 7Blocked-on-Security: Pharicator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1754548 (10chasemp) [19:08:06] 6operations, 6Phabricator, 10Traffic, 7Blocked-on-Security: Pharicator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1645318 (10chasemp) with all due respect, it has to be reviewed by security before ops can step in :) [19:08:58] 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1754553 (10chasemp) [19:10:18] 6operations, 10ops-eqiad: db1030 RAID degraded (disk failed) - https://phabricator.wikimedia.org/T116499#1754557 (10chasemp) p:5Triage>3High a:3Cmjohnson [19:11:08] 7Puppet, 6operations, 10Continuous-Integration-Config: translatewiki-puppetlint-strict does not honor puppet-lint.rc file in /puppet - https://phabricator.wikimedia.org/T116552#1754562 (10chasemp) p:5Triage>3Normal [19:11:34] PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:12:46] 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1754571 (10chasemp) p:5Triage>3Normal [19:12:50] (03PS1) 10Chad: ContentTranslation: Use wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248938 [19:13:14] RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING [19:13:47] 10Ops-Access-Requests, 6operations: Add Matanya to "restricted" to perform server side uploads - https://phabricator.wikimedia.org/T106447#1754575 (10chasemp) I'll make a note to include this in the next ops meeting. [19:15:05] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:16:40] 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1754582 (10chasemp) >>! In T116487#1754144, @greg wrote: > Jan: can you put in the description why you are requesting access? :) This was the status as of the Ops meeting today. [19:17:20] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1754586 (10bd808) >>! In T87036#1753813, @Joe wrote: > My current plan **for tin ** is to ask people to use mira instead of tin... [19:17:23] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:18:34] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [19:19:03] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [19:19:28] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1754591 (10chasemp) a:3akosiaris Status as of now: not directly accepted, some extra information since the idea was the AQS requires scap3 and not ansibl... [19:19:36] 6operations: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1754593 (10chasemp) [19:19:55] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754595 (10RobH) 3NEW a:3RobH [19:20:06] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1754604 (10RobH) [19:20:07] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754603 (10RobH) [19:20:25] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754595 (10RobH) [19:20:26] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1754610 (10RobH) [19:20:28] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1754605 (10RobH) 5Open>3Resolved WMF3542 is allocated as hostname lawrencium for this use. T116645 is for installation, resolving #hardware-request. [19:20:56] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754595 (10RobH) [19:21:33] (03PS1) 10RobH: setting lawrencium dns entries [dns] - 10https://gerrit.wikimedia.org/r/248939 [19:21:39] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1754613 (10bd808) >>! In T87036#1753813, @Joe wrote: > For terbium, I still need to understand how much work - if any - will be... [19:22:08] (03CR) 10RobH: [C: 032] setting lawrencium dns entries [dns] - 10https://gerrit.wikimedia.org/r/248939 (owner: 10RobH) [19:23:51] ebernhardson: do we have any blockers other than the nobelium hardware issue? [19:23:58] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1754618 (10chasemp) p:5High>3Normal [19:24:23] 6operations: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1754619 (10akosiaris) >>! In T116169#1754591, @chasemp wrote: > Status as of now: not directly accepted, some extra information since the idea was the AQS requires scap3 and not ansible > > @akosiaris... [19:24:25] 6operations, 10ops-eqiad: label server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T116646#1754621 (10RobH) 3NEW a:3Cmjohnson [19:25:21] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754628 (10RobH) [19:25:54] PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:27:44] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:29:24] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [19:29:33] RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING [19:31:08] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1754644 (10demon) >>! In T87036#1754586, @bd808 wrote: >>>! In T87036#1753813, @Joe wrote: >> My current plan **for tin ** is t... [19:31:38] 6operations, 10Traffic, 7Performance: missing SPDY coalesce for upload.wm.o for images ref'd in projects' page outputs - https://phabricator.wikimedia.org/T116132#1754647 (10BBlack) There are a few concerns here which is why this is kind of "back burner" for now, but on the longer-term radar: * Even in the... [19:31:55] 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1754650 (10chasemp) @fgiunchedi shoutout as a packaging guru :D Can you provide any guidance? [19:32:54] (03PS1) 10RobH: setting lawrencium install params [puppet] - 10https://gerrit.wikimedia.org/r/248941 [19:33:28] (03CR) 10RobH: [C: 032] setting lawrencium install params [puppet] - 10https://gerrit.wikimedia.org/r/248941 (owner: 10RobH) [19:33:44] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1754657 (10chasemp) Notes from meeting: Ironic will have its own model, for now the mgmt interface for any labs "hardware" node will be on i... [19:34:45] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1754663 (10chasemp) [19:36:05] !log swapped bad disk on db1030 [19:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:38] 6operations, 10ops-eqiad: db1030 RAID degraded (disk failed) - https://phabricator.wikimedia.org/T116499#1754677 (10Cmjohnson) replaced disk cmjohnson@db1030:~$ sudo megacli -PDList -aALL |grep "Firmware state:" Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Fi... [19:40:53] apergos: still there? [19:41:08] physically yes :-D [19:41:17] ori: [19:41:29] mentally pretty checked out... what's up? [19:41:35] i'll have the new package on the snapshot hosts in a few minutes [19:41:53] ah that's excellent [19:43:39] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754696 (10RobH) [19:44:06] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754595 (10RobH) [19:45:25] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1754698 (10Ottomata) > If we have a use case for emitting two secondary events *to the same topic* that were both triggered by the same primary event... [19:46:13] PROBLEM - Host nobelium is DOWN: PING CRITICAL - Packet loss = 100% [19:47:24] 6operations, 6Analytics-Backlog, 10Wikimedia-Mailing-lists: Requests to lists.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116429#1754706 (10Dzahn) Ok, let's not mix up dumps. and lists. in a single ticket please. They are different and unrelated. I'm... [19:49:33] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Puppet has 1 failures [19:50:09] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1754709 (10Ottomata) What do y'all think about keeping these 'framing' fields in a nested object? I'm not sure if this is a good or bad idea. If la... [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151026T2000). Please do the needful. [20:00:13] RECOVERY - Host nobelium is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [20:00:39] will be another 15-30 mins before we are ready to deploy parsoid [20:01:59] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754777 (10RobH) Ok, it seems the H310 doesn't play nice with Debian/Jessie, so the server I allocated won't work. Re-opening the allocation task. [20:02:45] cmjohnson1: how did it go? [20:03:11] (03PS1) 10Eevans: cassandra: updated gc settings [puppet] - 10https://gerrit.wikimedia.org/r/248960 (https://phabricator.wikimedia.org/T106619) [20:03:22] 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1754782 (10akosiaris) >>! In T115937#1754219, @Dzahn wrote: > @akosiaris @arielglenn @tstarling any ideas where else we could get "conf-svn"? I don't know after trying the old home_pmtpa that was mounted o... [20:03:31] yuvipanda...done [20:03:42] ok [20:03:50] I'll stress it and see what happens! [20:05:55] !log running stress on nobelium [20:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:53] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1754793 (10RobH) [20:06:54] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754792 (10RobH) [20:06:55] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1754789 (10RobH) 5Resolved>3Open So WMF3542 has an H310 controller, which Jessie doesn't detect. Since we don't like using these controllers, I can either replace it with a 710 (overkil... [20:09:49] 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1754814 (10akosiaris) >>! In T115937#1754782, @akosiaris wrote: >>>! In T115937#1754219, @Dzahn wrote: >> @akosiaris @arielglenn @tstarling any ideas where else we could get "conf-svn"? I don't know after... [20:10:35] RECOVERY - RAID on db1030 is OK: OK: optimal, 1 logical, 2 physical [20:11:30] 6operations, 10ops-eqiad: Rename analytics1011, 1016, and 1019 to aqs1001, 1002, 1003 - https://phabricator.wikimedia.org/T116656#1754825 (10Cmjohnson) 3NEW a:3Cmjohnson [20:12:04] 6operations, 10OTRS: Apply security patch to OTRS (Scheduler Process ID File Access vulnerability) - https://phabricator.wikimedia.org/T114132#1754834 (10faidon) 5Open>3declined a:3faidon We'll upgrade OTRS to a newer major release instead, as work for this was already underway when this security vulnera... [20:12:22] no mobileapps service deploy today [20:13:43] !log deactivating ulsfo<->NTT BGP peering due to upcoming network migration [20:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:14:09] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1754857 (10RobH) Chatted with Ariel in IRC. Going to go with one of the: Dell PowerEdge R420, Dual Intel Xeon E5-2440, 32GB Memory, Dual 300GB SSD, Dual 500GB Nearline SAS promethium... [20:14:33] 6operations: install/setup/deploy X as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754860 (10RobH) [20:14:48] 6operations, 10ops-eqiad: label server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T116646#1754862 (10RobH) 5Open>3declined Declined, we aren't renaming this server after all. [20:14:49] 6operations: install/setup/deploy X as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754595 (10RobH) [20:16:24] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:22] YuviPanda: i can kick the import off once nobelium is working again, i don't think there are other blockers [20:17:22] (03PS3) 10Faidon Liambotis: Use testsystem role for ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/247239 (owner: 10Muehlenhoff) [20:17:29] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Use testsystem role for ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/247239 (owner: 10Muehlenhoff) [20:17:42] ebernhardson: coool. I'm running a stress test that's pegging all CPU cores [20:17:48] temperatures in 60-65 C [20:17:50] which seem ok [20:18:41] (03PS2) 10Faidon Liambotis: Mark rubidium as spare [puppet] - 10https://gerrit.wikimedia.org/r/246832 (owner: 10Muehlenhoff) [20:18:48] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Mark rubidium as spare [puppet] - 10https://gerrit.wikimedia.org/r/246832 (owner: 10Muehlenhoff) [20:18:56] YuviPanda: yea anything below 85 should be fine [20:18:59] ebernhardson: yeah [20:19:07] ebernhardson: wanna kick it off now? i can stop the test. [20:19:11] sure [20:19:32] cmjohnson1: seems to be all good! :D stress test didn't bring temperature over 65C [20:19:35] cmjohnson1: thanks! [20:19:47] !log stress test on nobelium complete, CPU temperature didn't go above 65C [20:19:49] cool! glad I could help [20:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:21:16] !log started copy of eqiad elasticsearch indices to noeblium [20:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:21:34] (03PS3) 10Faidon Liambotis: puppet: do not 'ensure latest' [puppet] - 10https://gerrit.wikimedia.org/r/247007 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [20:21:34] n00belium [20:21:39] heh [20:22:10] (03CR) 10Faidon Liambotis: [C: 032] puppet: do not 'ensure latest' [puppet] - 10https://gerrit.wikimedia.org/r/247007 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [20:22:17] ebernhardson: I see 'php' processes - are these not hhvm? [20:22:46] YuviPanda: it is hhvm, thats the default php via debian's /etc/alternatives [20:23:01] aah ok [20:23:14] PROBLEM - DPKG on osmium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:23:37] YuviPanda: i'm pretty sure the limit here is disk io as well, not going to manage to peg things completely [20:24:02] ebernhardson: yeah. any guesses on how long it'll take? [20:24:10] YuviPanda: a week? [20:24:19] ebernhardson: ok. [20:24:27] YuviPanda: i know thats not a great number, but its 380M documents at a few hundred per second [20:24:41] ebernhardson: yeah, 'tis still ok, esp. without the time pressure now [20:24:48] (vs 5-6k/s importing to codfw) [20:24:56] ebernhardson: what kindof tests will we do afterwards? [20:25:06] ebernhardson: oh, and that's solely because of write performance on the hardware? [20:25:27] actually it claims to have peaked at 16k docs/sec briefly, i think those were probably wiktionary's [20:25:33] YuviPanda: i think so, but not completely sur [20:25:52] YuviPanda: for labsearch i'm fairly certain, for codfw it could be a few things [20:25:53] right [20:26:36] because eventually we'll have to decom this host and figure out how much hardware we really need and then start that process [20:26:55] yea, not sure how to figure that out yet but will think of something :) [20:27:56] YuviPanda: i suppose the basic test to run is via our 'runSearch.php' script in elasticsearch. with that we can pump queries into the elasticsearch and see how it responds [20:28:21] ebernhardson: right. I don't know of ES's querying abilities, but can we do more interesting queries that we can't do in prod? [20:28:34] like, just target a particular title pattern for example [20:28:38] and before that, seeing if it can even handle the current write load or if we have to strategically disable a few popular wikis writes [20:28:50] ah, that [20:28:51] YuviPanda: can do *much* more via the ES api we are exposing in labs [20:28:52] yeah [20:29:09] YuviPanda: but the easiest way to test is piping one query per line into runSearch.php :) [20:29:28] right :) [20:29:41] just want to have a mix of expensive queries too [20:30:04] one difficulty...elasticsearch api can OOM the machine if your query is expensive enough [20:30:11] or really, the java VM [20:30:20] the machine will be fine, but the jvm will give up [20:30:27] will it restart? [20:30:34] will that cause the restart to be fairly long? [20:31:10] it should be possible to restart, but there have been 2 prod issues in the last year where it didn't restart [20:31:25] (we use a bunch of regex's to filter this out and deny them in php, but it sometimes misses something) [20:31:41] that would be much harder/impossible for es api queries directly though [20:31:52] which is partly becaues we allow globbing in prod right? [20:32:07] chasemp: yes, both cases afaik were due to globbing [20:32:37] ebernhardson: we could make the proxy be more intelligent [20:33:15] use the lua stuff in nginx, or write a simple one in pytohn / golang [20:33:18] YuviPanda: generally i'm not super worried about that, but maybe i should be? the last time this happened the person running the queries was specifically trying to run "very expensive things" to see what happens [20:33:23] are you thinking of building a quarry for es YuviPanda? [20:33:30] people kinda asume 'oh wikipedia, its huge they can take it' [20:33:39] chasemp: nah, although now that you mention it... [20:33:39] where as people on labs wil understand the limited nature of the labs replica [20:33:51] (03PS3) 10Dzahn: interface: do not 'ensure latest',do require_package [puppet] - 10https://gerrit.wikimedia.org/r/247008 (https://phabricator.wikimedia.org/T115348) [20:34:04] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:34:32] (03PS4) 10Faidon Liambotis: interface: do not 'ensure latest', use require_package [puppet] - 10https://gerrit.wikimedia.org/r/247008 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [20:34:35] ebernhardson: possibly, but they might not know what'll take it down [20:34:46] ebernhardson: but yeah, i'm of the 'let us expose it and see what happens' camp [20:34:47] YuviPanda: true [20:35:12] chasemp: I've been meaning to rewrite quarry to decouple it from SQL so much [20:35:23] chasemp: should be able to target WDQS, Posgres (OSM), and maybe even this [20:35:23] (03CR) 10Faidon Liambotis: [C: 032] interface: do not 'ensure latest', use require_package [puppet] - 10https://gerrit.wikimedia.org/r/247008 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [20:35:53] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:36:16] deploying fresh parsoid code [20:36:43] YuviPanda: would be super cool :) [20:38:18] chasemp: yeah. Quarry's the first thing I'm going to move to k8s [20:39:19] chasemp: won't be touching it for a few weeks, halfak and others (including me) are doing a workshop at an ACM conference about doing collaborative research via Quarry! [20:39:45] maybe I should just the write the new thing from scratch :) [20:43:58] (03PS2) 10Dzahn: admin: add datacenter-ops to palladium [puppet] - 10https://gerrit.wikimedia.org/r/248936 (https://phabricator.wikimedia.org/T115718) [20:44:07] (03CR) 10Legoktm: "Also extension-list?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248938 (owner: 10Chad) [20:45:20] (03PS3) 10Dzahn: admin: add datacenter-ops to palladium [puppet] - 10https://gerrit.wikimedia.org/r/248936 (https://phabricator.wikimedia.org/T115718) [20:46:34] RECOVERY - DPKG on osmium is OK: All packages OK [20:48:06] (03CR) 10Dzahn: [C: 032] "access request has been ACKed in ops meeting" [puppet] - 10https://gerrit.wikimedia.org/r/248936 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn) [20:50:01] !log deployed parsoid version 660c59a9 [20:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:50:14] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-0/0/2: down - Transit: ! NTT (service ID 234631) {#1061} [10Gbps]BR [20:52:08] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1755054 (10Krenair) (Note that WikitechPrivateLdapSettings actually comes from puppet so I don't think we need to worry about c... [20:53:01] (03Abandoned) 10Dzahn: test multi-role admin group behaviour [puppet] - 10https://gerrit.wikimedia.org/r/248928 (owner: 10Dzahn) [20:53:14] test succesful - we know it fails [20:53:30] (03Abandoned) 10Dzahn: admin: add dc-ops group to role access_new_install [puppet] - 10https://gerrit.wikimedia.org/r/246850 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn) [20:59:04] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: puppet fail [21:00:55] 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1755109 (10Reedy) Hmm. It should be at /srv/home_pmtpa/conf-svn but there's no sign of it. I wonder if that means we deleted it at some point. Not quite sure why we would've deleted it.. Else that means it... [21:02:35] (03PS2) 10Dzahn: admin: add papaul to datacenter ops group [puppet] - 10https://gerrit.wikimedia.org/r/246849 [21:02:44] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail [21:03:38] (03PS3) 10Dzahn: admin: add papaul to datacenter ops group [puppet] - 10https://gerrit.wikimedia.org/r/246849 [21:04:43] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [21:05:58] (03CR) 10Dzahn: [C: 032] admin: add papaul to datacenter ops group [puppet] - 10https://gerrit.wikimedia.org/r/246849 (owner: 10Dzahn) [21:06:33] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:07:28] (03CR) 10Dzahn: "for T115718 and has been acked in meeting" [puppet] - 10https://gerrit.wikimedia.org/r/246849 (owner: 10Dzahn) [21:08:36] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1755124 (10GWicke) >>! In T116247#1754698, @Ottomata wrote: >> If we have a use case for emitting two secondary events *to the same topic* that were... [21:16:13] 6operations, 10ops-eqiad: Rename analytics1011, 1016, and 1019 to aqs1001, 1002, 1003 - https://phabricator.wikimedia.org/T116656#1755150 (10Cmjohnson) racktables and lables updated...looks like dns was completed already [21:16:19] 6operations, 10ops-eqiad: Rename analytics1011, 1016, and 1019 to aqs1001, 1002, 1003 - https://phabricator.wikimedia.org/T116656#1755151 (10Cmjohnson) 5Open>3Resolved [21:18:50] 6operations, 10ops-eqiad, 3labs-sprint-118: nobelium is overheating - https://phabricator.wikimedia.org/T116439#1755165 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson Looks to be good From IRC YuviPanda cmjohnson1: seems to be all good! :D stress test didn't bring temperature over 65C [21:26:04] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:26:16] jzerebecki or hoo: ping [21:27:03] 6operations, 5Patch-For-Review: create new admin group for datacenter ops to add new systems to puppet - https://phabricator.wikimedia.org/T115718#1755188 (10Dzahn) because adding admin group via role (https://gerrit.wikimedia.org/r/#/c/246850/), doesn't work (https://gerrit.wikimedia.org/r/#/c/248928/) i ad... [21:27:09] 6operations, 5Patch-For-Review: create new admin group for datacenter ops to add new systems to puppet - https://phabricator.wikimedia.org/T115718#1755189 (10Dzahn) 5Open>3Resolved [21:27:21] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: create new admin group for datacenter ops to add new systems to puppet - https://phabricator.wikimedia.org/T115718#1730969 (10Dzahn) [21:29:45] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:32:11] the dbstore1002 issue is expected, I am inserting rows there as crazy [21:37:20] 6operations, 10ops-eqiad: Reclaim einsteinium.eqiad.wmnet for spares - https://phabricator.wikimedia.org/T116252#1755228 (10Cmjohnson) [21:37:22] 6operations, 10ops-eqiad: wipe einsteinium disks - https://phabricator.wikimedia.org/T116253#1755226 (10Cmjohnson) 5Open>3Resolved wiped [21:39:29] cmjohnson1: Hi, what's up? [21:40:34] hi hoo: wdqs1001 and 1002 never had mgmt correctly setup when they were re-named. This will require a few minutes of downtime...i want to know when I could do this [21:40:52] Any time is as bad as any other [21:40:52] https://phabricator.wikimedia.org/T84686 [21:41:09] The service is declared beta, so it should be ok-ish to just take them down for a short bit [21:41:28] okay..mind if I take 1001 down now? [21:41:30] Also... do you need to do both at once? If not, that should be fine [21:41:37] no..one at a time [21:41:47] That should be fine [21:41:51] SMalyshev: ^ [21:42:02] (03CR) 10MaxSem: "Tested by cherrypicking on beta puppetmaster - works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/248374 (owner: 10MaxSem) [21:42:13] I guess everything is puppetized by now, so that stuff should come back on its own [21:46:48] (03PS1) 10Cmjohnson: Adding mgmt entries for wdqs1001/2 [dns] - 10https://gerrit.wikimedia.org/r/248999 [21:47:35] (03CR) 10Cmjohnson: [C: 032] Adding mgmt entries for wdqs1001/2 [dns] - 10https://gerrit.wikimedia.org/r/248999 (owner: 10Cmjohnson) [21:48:15] !log powering off wdqs1001 to update idrac settings [21:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:49:04] PROBLEM - WDQS SPARQL on wdqs1001 is CRITICAL: Connection timed out [21:50:37] 6operations, 10ops-codfw, 7Swift: rack & initial on-site setup of ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1755283 (10Papaul) ms-be2016 10.193.1.12 port xe-0/2/7 ms-be2017 10.193.1.13 port xe-0/7/7 ms-be2018 10.193.1.14 port xe-0/2/7 ms-be2019 10.193... [21:51:33] PROBLEM - Host wdqs1001 is DOWN: PING CRITICAL - Packet loss = 100% [21:55:14] RECOVERY - Host wdqs1001 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [21:59:06] PROBLEM - puppet last run on wdqs1001 is CRITICAL: CRITICAL: puppet fail [21:59:06] PROBLEM - WDQS HTTP on wdqs1001 is CRITICAL: Connection refused [21:59:54] PROBLEM - WDQS HTTP Port on wdqs1001 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused [22:00:14] PROBLEM - Blazegraph Port on wdqs1001 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused [22:00:43] PROBLEM - Blazegraph process on wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (blazegraph), regex args ^java .* blazegraph-service-.*-dist.war [22:01:27] eh? ok [22:01:29] oh [22:02:28] greg-g: It's fine [22:02:30] kinda [22:03:56] production monitoring with "declared beta" [22:04:17] doesn't understand semi-prod [22:05:07] It's stable enough for production monitoring, so I don't see an issue there [22:05:16] although I'm not on these alerts [22:05:51] where's the beta part then [22:06:40] It's mostly beta as in the data representation isn't fully stable yet [22:07:00] ah,ok [22:07:09] also, *I* don't think we know much about the stability, yet... we just never had a problem [22:07:43] I wonder how much user QPS it does... that's not visible anywhere AFAIK [22:07:55] maybe the proxy in front of it logs [22:08:51] meh, seems like the services didn't come up on their own [22:09:07] cmjohnson1: ^ [22:09:14] So please leave the other one for now [22:09:54] hoo: the other was fine...so no need to bring it down. sorry for breaking things [22:12:07] 6operations: puppetize wdqs service startup on boot - https://phabricator.wikimedia.org/T116673#1755399 (10Dzahn) 3NEW [22:12:16] and that' [22:12:51] hoo: should i start it then? [22:13:30] mutante: I think we should do it via puppet [22:13:55] that's what the ticket is for ^ [22:14:07] but don't you want it to run now [22:15:23] mutante: Fixing it via puppet is very easy [22:15:53] 6operations: puppetize wdqs service startup on boot - https://phabricator.wikimedia.org/T116673#1755420 (10Dzahn) [22:17:10] ok, thought you'd want a "service start" to fix the current breakage [22:17:27] If we can do it via puppet now, I'd prefer that [22:18:07] (03PS1) 10Hoo man: Enable the wdqs service per default [puppet] - 10https://gerrit.wikimedia.org/r/249005 (https://phabricator.wikimedia.org/T116673) [22:18:21] mutante|away: ^ that should do it [22:18:26] unless I screwed the syntax [22:19:03] ah, missed him :S [22:19:53] Anyone else around? [22:20:26] RECOVERY - puppet last run on wdqs1001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [22:21:04] RECOVERY - WDQS HTTP Port on wdqs1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 80 [22:21:10] wtf [22:21:24] RECOVERY - Blazegraph Port on wdqs1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 [22:21:27] (03CR) 10BBlack: [C: 032] Enable the wdqs service per default [puppet] - 10https://gerrit.wikimedia.org/r/249005 (https://phabricator.wikimedia.org/T116673) (owner: 10Hoo man) [22:21:44] RECOVERY - Blazegraph process on wdqs1001 is OK: PROCS OK: 1 process with UID = 998 (blazegraph), regex args ^java .* blazegraph-service-.*-dist.war [22:22:36] hoo: if it came up unexpectedly, it was likely puppet running every 30 minutes [22:23:19] mh, ok [22:23:39] still having it properly enabled is better than waiting for puppet [22:23:43] yeah [22:23:56] for wdqs1001 runs at :19 and :49 every hour for puppet [22:24:09] That matches with the recoveries [22:24:27] only way to stop puppet from turning on something manually disabled is to disable puppet (as root puppet agent --disable "reason why") [22:25:14] I don't even have shell on these boxes, so I'm mostly guessing around :P [22:25:34] (03PS1) 10MaxSem: [WIP] Switch www.wikimedia.org to source control [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) [22:27:03] 6operations: puppetize wdqs service startup on boot - https://phabricator.wikimedia.org/T116673#1755486 (10hoo) 5Open>3Resolved a:3hoo [22:27:37] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1755489 (10Yurik) 5Resolved>3Open Few issues: 1. Only works on maps2001, not {2-4} 2... [22:28:08] (03PS2) 10MaxSem: [WIP] Switch www portals to be deployed from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248526 (https://phabricator.wikimedia.org/T115964) [22:30:34] PROBLEM - NTP on wdqs1001 is CRITICAL: NTP CRITICAL: Offset unknown [22:35:56] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1755565 (10greg) I haven't been seeing any more of my phab+gerrit mails going into spam. Anyone else? [22:39:21] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1755571 (10MoritzMuehlenhoff) >>! In T115416#1755565, @greg wrote: > I haven't been seeing any more of my phab+gerrit mails going into spam. > > Anyone else? Me neither, work... [22:44:35] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1755598 (10Legoktm) >>! In T115416#1735775, @greg wrote: > The middle one (list-unsubscribe header) doesn't make sense to my pedantic brain (they aren't mailing lists); anyone... [22:45:20] legoktm: haha ^ [22:48:59] (03PS2) 10EBernhardson: Send CirrusSearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248871 [22:49:06] (03CR) 10jenkins-bot: [V: 04-1] Send CirrusSearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248871 (owner: 10EBernhardson) [22:49:31] (03PS1) 10BBlack: ssl_ciphersuite: add ECDHE+3DES options [puppet] - 10https://gerrit.wikimedia.org/r/249017 [22:53:38] 6operations, 7Mail, 15User-greg: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1755627 (10greg) 5Open>3Resolved a:3greg Calling this good. If anyone wants to take up the issues raised by @JKrauska they should be separate tasks. [22:55:09] (03PS3) 10EBernhardson: Send CirrusSearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248871 [22:56:03] PROBLEM - puppet last run on mw2058 is CRITICAL: CRITICAL: Puppet has 1 failures [23:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151026T2300). [23:00:04] Krenair ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:42] Hi [23:00:51] ebernhardson, I'll do your patch first [23:01:17] (03CR) 10Alex Monk: [C: 032] Send CirrusSearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248871 (owner: 10EBernhardson) [23:01:24] (03Merged) 10jenkins-bot: Send CirrusSearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248871 (owner: 10EBernhardson) [23:01:57] Krenair: k [23:03:43] ebernhardson, I've synced it to mw1017 and testwiki should now be writing to codfw [23:03:47] please check [23:04:42] Krenair: checking [23:05:13] Krenair: everything looks sane [23:05:17] ok [23:06:10] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/248871/3 (duration: 00m 18s) [23:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:20] ebernhardson, now the other sites should be doing it, please confirm all is ok [23:07:11] (03PS1) 10RobH: Revert "setting lawrencium install params" [puppet] - 10https://gerrit.wikimedia.org/r/249020 [23:07:31] (03PS2) 10RobH: Revert "setting lawrencium install params" [puppet] - 10https://gerrit.wikimedia.org/r/249020 [23:07:51] (03CR) 10RobH: [C: 032] Revert "setting lawrencium install params" [puppet] - 10https://gerrit.wikimedia.org/r/249020 (owner: 10RobH) [23:10:30] Krenair: hrm, getting errors where it complains pages dont exist in the new cluster :S [23:10:39] those don't really hurt anything, lemme double check job queue [23:11:23] Krenair: yea its safe to leave out there. That class of error we get rid of immediatly. Leaving it up will let me debug whats going on here and i can undeploy it later once i figure out what is going on [23:11:41] ok [23:12:13] (03PS2) 10Alex Monk: Remove old unused wmgUseAPIRequestLog code referencing locke, a pmtpa host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248475 [23:12:36] (03CR) 10Alex Monk: [C: 032] Remove old unused wmgUseAPIRequestLog code referencing locke, a pmtpa host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248475 (owner: 10Alex Monk) [23:12:42] (03Merged) 10jenkins-bot: Remove old unused wmgUseAPIRequestLog code referencing locke, a pmtpa host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248475 (owner: 10Alex Monk) [23:15:15] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/248475/ (duration: 00m 17s) [23:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:00] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/248475/ (duration: 00m 17s) [23:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:19] (03PS2) 10Alex Monk: Remove old bugzilla and mingle.corp RSS whitelist entries from mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248478 [23:16:43] (03CR) 10Alex Monk: [C: 032] Remove old bugzilla and mingle.corp RSS whitelist entries from mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248478 (owner: 10Alex Monk) [23:16:53] (03Merged) 10jenkins-bot: Remove old bugzilla and mingle.corp RSS whitelist entries from mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248478 (owner: 10Alex Monk) [23:17:31] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/248478/ (duration: 00m 17s) [23:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:42] yay, no corp references in wmf-config :) [23:18:24] (03PS2) 10Alex Monk: Change Venetian Wikipedia logo per admin request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248633 (https://phabricator.wikimedia.org/T116476) [23:18:29] (03CR) 10Alex Monk: [C: 032] Change Venetian Wikipedia logo per admin request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248633 (https://phabricator.wikimedia.org/T116476) (owner: 10Alex Monk) [23:18:36] (03Merged) 10jenkins-bot: Change Venetian Wikipedia logo per admin request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248633 (https://phabricator.wikimedia.org/T116476) (owner: 10Alex Monk) [23:18:42] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1755730 (10Dzahn) >>! In T115067#1755489, @Yurik wrote: > 1. Only works on maps2001, not... [23:19:15] !log krenair@tin Synchronized w/static/images/project-logos/vecwiki.png: https://gerrit.wikimedia.org/r/#/c/248633/ (duration: 00m 18s) [23:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:18] (purged etc. too) [23:21:24] (03PS2) 10Alex Monk: Add QuickSurveys to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248632 [23:21:29] (03CR) 10Alex Monk: [C: 032] Add QuickSurveys to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248632 (owner: 10Alex Monk) [23:21:34] (03Merged) 10jenkins-bot: Add QuickSurveys to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248632 (owner: 10Alex Monk) [23:22:45] !log krenair@tin Synchronized wmf-config/extension-list-labs: https://gerrit.wikimedia.org/r/#/c/248632/ (duration: 00m 17s) [23:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:03] RECOVERY - puppet last run on mw2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:24:45] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1755777 (10Dzahn) >>! In T115067#1755730, @Dzahn wrote: > Checking this i found 2002-20... [23:28:15] (03PS3) 10Alex Monk: Changed wgNamespacesToBeSearchedDefault for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248371 (https://phabricator.wikimedia.org/T114932) (owner: 10Pmlineditor) [23:28:21] (03CR) 10Alex Monk: [C: 032] Changed wgNamespacesToBeSearchedDefault for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248371 (https://phabricator.wikimedia.org/T114932) (owner: 10Pmlineditor) [23:28:26] (03Merged) 10jenkins-bot: Changed wgNamespacesToBeSearchedDefault for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248371 (https://phabricator.wikimedia.org/T114932) (owner: 10Pmlineditor) [23:29:14] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/248371/ (duration: 00m 20s) [23:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:32:47] (03PS1) 10Dzahn: tilerator/k10n: add trailing * to journalctl sudo [puppet] - 10https://gerrit.wikimedia.org/r/249023 (https://phabricator.wikimedia.org/T115067) [23:35:29] hoo: nominating hoo for wdqs-admin group [23:36:19] (03CR) 10Yurik: [C: 031] "Thanks for getting the patch quickly!" [puppet] - 10https://gerrit.wikimedia.org/r/249023 (https://phabricator.wikimedia.org/T115067) (owner: 10Dzahn) [23:36:54] (03CR) 10Dzahn: [C: 032] "just follow-up fix to ACKed request. this was the intention." [puppet] - 10https://gerrit.wikimedia.org/r/249023 (https://phabricator.wikimedia.org/T115067) (owner: 10Dzahn) [23:37:02] Stas doesn't want more admin, so probably not [23:41:18] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1755845 (10Dzahn) >>! In T115067#1755489, @Yurik wrote: > 2. Turns out this approach doe... [23:41:54] PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: Puppet has 1 failures [23:44:35] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1755853 (10Dzahn) 5Open>3Resolved [23:47:36] (03PS1) 10Yuvipanda: ssh: Allow direct login as servicegroups [puppet] - 10https://gerrit.wikimedia.org/r/249024 (https://phabricator.wikimedia.org/T113979) [23:49:15] (03PS2) 10Yuvipanda: ssh: Allow direct login as servicegroups [puppet] - 10https://gerrit.wikimedia.org/r/249024 (https://phabricator.wikimedia.org/T113979) [23:49:34] (03PS3) 10Yuvipanda: ssh: Allow direct login as servicegroups [puppet] - 10https://gerrit.wikimedia.org/r/249024 (https://phabricator.wikimedia.org/T113979) [23:49:44] (03Abandoned) 10Dzahn: admin: create group aqs-restbase-deployers [puppet] - 10https://gerrit.wikimedia.org/r/248378 (https://phabricator.wikimedia.org/T116169) (owner: 10Dzahn) [23:49:58] (03CR) 10Yuvipanda: [C: 032 V: 032] ssh: Allow direct login as servicegroups [puppet] - 10https://gerrit.wikimedia.org/r/249024 (https://phabricator.wikimedia.org/T113979) (owner: 10Yuvipanda) [23:54:24] 10Ops-Access-Requests, 6operations: wdqs-admin group membership for Marius Hoch (hoo) and Jan Zerebecki - https://phabricator.wikimedia.org/T116702#1755904 (10Dzahn) 3NEW [23:55:27] Krenair, could you let me know when you're done SWAT? We have a patch (wasn't planned for SWAT, it's an unbreak now we just discovered). [23:55:44] matt_flaschen, yep, sorry, I'm done [23:55:51] No problem, thanks. [23:55:51] want me to sync something? [23:55:57] or are you going to? or..? [23:56:07] Krenair, sure if you don't mind: https://gerrit.wikimedia.org/r/#/c/249026/ [23:56:56] (03PS1) 10Dzahn: admin: hoo and jzerebecki for wdqs admins [puppet] - 10https://gerrit.wikimedia.org/r/249027 (https://phabricator.wikimedia.org/T116702) [23:56:57] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: wdqs-admin group membership for Marius Hoch (hoo) and Jan Zerebecki - https://phabricator.wikimedia.org/T116702#1755922 (10Dzahn) [23:56:58] doing [23:57:02] Thanks [23:57:46] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: wdqs-admin group membership for Marius Hoch (hoo) and Jan Zerebecki - https://phabricator.wikimedia.org/T116702#1755932 (10Dzahn) p:5Triage>3Normal [23:58:00] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1755934 (10Yurik) Works, awesome, thanks!