[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151223T0000). Please do the needful. [00:00:10] 6operations, 6RevisionScoringAsAService, 10ores, 7Monitoring: Add monitoring to ORES workers - https://phabricator.wikimedia.org/T121656#1899724 (10Dzahn) p:5Triage>3Normal [00:01:15] Reedy: legoktm: anomie: Is https://gerrit.wikimedia.org/r/#/c/257950/2 something that can feasibly be covered by a unit test? [00:05:13] I'm not really sure [00:06:29] (03PS1) 10Dzahn: mha: move roles to module/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260694 [00:06:31] (03PS1) 10Dzahn: ores: enhance ORES monitoring pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/260695 (https://phabricator.wikimedia.org/T121656) [00:08:36] (03PS2) 10Dzahn: mha: move roles to module/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260694 [00:09:26] (03PS2) 10Dzahn: ores: enhance ORES monitoring pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/260695 (https://phabricator.wikimedia.org/T121656) [00:10:48] (03PS3) 10Dzahn: ores: enhance ORES monitoring pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/260695 (https://phabricator.wikimedia.org/T121656) [00:11:47] (03CR) 10Dzahn: [C: 032] ores: enhance ORES monitoring pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/260695 (https://phabricator.wikimedia.org/T121656) (owner: 10Dzahn) [00:12:29] (03PS1) 10Jdlrobson: Fix invalid config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260696 [00:14:06] (03CR) 10Alex Monk: [C: 032] Fix invalid config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260696 (owner: 10Jdlrobson) [00:14:35] (03Merged) 10jenkins-bot: Fix invalid config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260696 (owner: 10Jdlrobson) [00:16:20] (03PS3) 10Dzahn: mha: move roles to module/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260694 [00:19:12] (03PS1) 10Dzahn: package::builder: move role to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260697 [00:22:45] (03PS1) 10Dzahn: ganglia: move roles to modeules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260698 [00:23:03] (03PS2) 10Dzahn: ganglia: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260698 [00:24:44] 6operations, 6RevisionScoringAsAService, 10ores, 7Monitoring, 5Patch-For-Review: Add monitoring to ORES workers - https://phabricator.wikimedia.org/T121656#1899832 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/260692/2 https://gerrit.wikimedia.org/r/#/c/260695/ https://icinga.wikimedia.org/cgi-bin/icing... [00:25:06] (03PS1) 10Jdlrobson: ... and drop invalid alpha platform from survey definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260699 [00:26:03] (03CR) 10Alex Monk: [C: 032] ... and drop invalid alpha platform from survey definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260699 (owner: 10Jdlrobson) [00:26:25] 6operations, 10Deployment-Systems, 5Patch-For-Review: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#1899836 (10Dzahn) a:5Dzahn>3None [00:26:30] So technically there's nothing up for swat [00:26:34] (03Merged) 10jenkins-bot: ... and drop invalid alpha platform from survey definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260699 (owner: 10Jdlrobson) [00:26:36] But we're fixing QuickSurveys stuff in beta now [00:26:43] because it completely took out beta [00:28:31] I'm sure greg would approve [00:28:37] although we're only touching -labs files anyway, so.. [00:31:28] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [00:31:51] !log Ran UPDATE flow_workflow SET workflow_page_id = 41854369 WHERE workflow_wiki = 'enwiki' AND workflow_namespace = 5 AND workflow_title_text = 'Flow/Developer_test_page' AND workflow_page_id = 48099373; to work around DB inconsistency (T117812) [00:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:32:27] PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:32:28] PROBLEM - SSH on mw1133 is CRITICAL: Server answer [00:33:28] PROBLEM - dhclient process on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:34:26] PROBLEM - HHVM processes on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:35:25] PROBLEM - Check size of conntrack table on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:35:44] PROBLEM - DPKG on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:35:55] PROBLEM - salt-minion processes on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:35:55] PROBLEM - Disk space on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:36:14] PROBLEM - HHVM rendering on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:36:24] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [00:36:50] !log manually fixed up stuck global rename of "RCJU-ArCJ" -> "Archives cantonales jurassiennes" [00:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:36:56] !lgo mw1133 - powercycle [00:37:02] !log mw1133 - powercycle [00:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:37:54] PROBLEM - nutcracker process on mw1133 is CRITICAL: Timeout while attempting connection [00:38:44] PROBLEM - RAID on mw1133 is CRITICAL: Timeout while attempting connection [00:39:04] PROBLEM - configured eth on mw1133 is CRITICAL: Connection refused by host [00:39:14] RECOVERY - DPKG on mw1133 is OK: All packages OK [00:39:15] RECOVERY - HHVM processes on mw1133 is OK: PROCS OK: 1 process with command name hhvm [00:39:32] scap is still waiting for that host :/ [00:39:34] RECOVERY - salt-minion processes on mw1133 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:39:34] RECOVERY - Disk space on mw1133 is OK: DISK OK [00:39:35] RECOVERY - nutcracker process on mw1133 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:39:45] RECOVERY - dhclient process on mw1133 is OK: PROCS OK: 0 processes with command name dhclient [00:40:26] RECOVERY - RAID on mw1133 is OK: OK: no RAID installed [00:40:35] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.502 second response time [00:40:44] RECOVERY - SSH on mw1133 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [00:40:45] RECOVERY - Check size of conntrack table on mw1133 is OK: OK: nf_conntrack is 4 % full [00:40:53] !log krenair@tin Synchronized wmf-config/CommonSettings-labs.php: https://gerrit.wikimedia.org/r/260696 & https://gerrit.wikimedia.org/r/260699 (duration: 05m 28s) [00:40:55] RECOVERY - configured eth on mw1133 is OK: OK - interfaces up [00:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:41:56] RECOVERY - HHVM rendering on mw1133 is OK: HTTP OK: HTTP/1.1 200 OK - 65913 bytes in 2.074 second response time [00:42:07] (03PS1) 10Dzahn: ores: adjust string to check on home page [puppet] - 10https://gerrit.wikimedia.org/r/260701 [00:43:25] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [00:43:41] (03CR) 10jenkins-bot: [V: 04-1] ores: adjust string to check on home page [puppet] - 10https://gerrit.wikimedia.org/r/260701 (owner: 10Dzahn) [00:43:55] (03PS2) 10Dzahn: ores: adjust string to check on home page [puppet] - 10https://gerrit.wikimedia.org/r/260701 [00:45:33] (03CR) 10jenkins-bot: [V: 04-1] ores: adjust string to check on home page [puppet] - 10https://gerrit.wikimedia.org/r/260701 (owner: 10Dzahn) [00:45:37] (03PS3) 10Dzahn: ores: adjust string to check on home page [puppet] - 10https://gerrit.wikimedia.org/r/260701 [00:45:40] groar [00:47:51] (03CR) 10Dzahn: [C: 032] ores: adjust string to check on home page [puppet] - 10https://gerrit.wikimedia.org/r/260701 (owner: 10Dzahn) [00:48:48] really gerrit, it switches to the "submit" button but clicking it nothing happens [00:49:31] grmbl...dependencies [00:50:15] (03PS4) 10Dzahn: ores: adjust string to check on home page [puppet] - 10https://gerrit.wikimedia.org/r/260701 [01:01:37] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: puppet fail [01:06:08] mutante: ah, is that why ores just paged? [01:08:58] Somehow, my ORES alerts are coming to the wrong mailbox. [01:09:11] How difficult is it to switch that? [01:11:10] * halfak forwards via gmail [01:12:56] YuviPanda: yea :/ [01:13:14] halfak: which one is the right one then [01:13:26] it contacts "team-ores" [01:14:11] sent via PM [01:27:27] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [01:29:51] (03PS1) 10Dzahn: ores: do not check for a string on / [puppet] - 10https://gerrit.wikimedia.org/r/260705 [01:30:25] (03CR) 10Dzahn: [C: 032] ores: do not check for a string on / [puppet] - 10https://gerrit.wikimedia.org/r/260705 (owner: 10Dzahn) [01:30:44] 6operations, 6RevisionScoringAsAService, 10ores, 7Monitoring, 5Patch-For-Review: Add monitoring to ORES workers - https://phabricator.wikimedia.org/T121656#1899912 (10Halfak) I have confirmed that paging now makes it to my phone. [01:31:20] * halfak gets pages on his phone [01:31:22] Success! [01:31:36] Now to deal with the waking-partner-up-at-2am problem :) [01:31:56] halfak: we can change your timezone [01:32:01] Suddenly, Jenny has a stake in ORES system stability. [01:32:04] it doesnt have to be 24x7 [01:32:19] mutante, good to know, but let's leave it as-is for now. [01:33:11] so, the summary: the new monitoring that actually checks URLs like http://ores.wmflabs.org/scores/testwiki/reverted// has been added and works [01:33:28] the paging was because i touched the old monitoring which i should have just left alone [01:33:50] and that is going to be just like before again [01:34:21] mutante, how often does icinga run checks? [01:34:24] both are green again now, sorry [01:34:27] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ores.wmflabs.org&nostatusheader [01:39:24] halfak: it's configurable with normal_check_interval and it's set to 1 (minute) for this one [01:39:58] mutante, that should be good. Thanks :) [01:40:10] ^ 1 minute checks [01:41:06] 6operations, 6RevisionScoringAsAService, 10ores, 7Monitoring, 5Patch-For-Review: Add monitoring to ORES workers - https://phabricator.wikimedia.org/T121656#1899917 (10Dzahn) 5Open>3Resolved https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ores.wmflabs.org&nostatusheader has been implement... [01:41:27] 6operations, 6RevisionScoringAsAService, 10ores, 7Monitoring: Add monitoring to ORES workers - https://phabricator.wikimedia.org/T121656#1899919 (10Dzahn) [01:42:42] halfak: you're welcome. i'll call it resolved and leave it untouched now :) [01:42:53] YuviPanda too [01:43:06] :) [01:43:50] :D [01:43:55] thanks ofr handing it, mutante [01:48:34] (03PS1) 10Dzahn: add several parked domains [dns] - 10https://gerrit.wikimedia.org/r/260706 (https://phabricator.wikimedia.org/T121914) [02:23:24] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 09m 18s) [02:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:26] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Dec 23 02:30:25 UTC 2015 (duration 7m 1s) [02:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:52:59] PROBLEM - puppet last run on mw2094 is CRITICAL: CRITICAL: puppet fail [03:21:30] RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:39:40] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail [03:59:41] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 73346 MB (3% inode=99%) [04:04:39] (03PS2) 10Yuvipanda: dynamicproxy: Add error graphite counts too [puppet] - 10https://gerrit.wikimedia.org/r/260625 [04:05:47] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Add error graphite counts too [puppet] - 10https://gerrit.wikimedia.org/r/260625 (owner: 10Yuvipanda) [04:06:21] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:30:45] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:49] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:09] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:19] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:58] PROBLEM - cassandra-a service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [06:33:10] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:28] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:39] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:59] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:19] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:19] PROBLEM - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is CRITICAL: Connection refused [06:35:52] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1900332 (10JanZerebecki) [06:36:58] 6operations, 6Analytics-Kanban, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1900334 (10JanZerebecki) Will the MVP include being publicly accessible, i.e. anyone on the Internet can run a consumer? [06:39:10] 6operations, 6Analytics-Kanban, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1900343 (10yuvipanda) >>! In T114443#1900334, @JanZerebecki wrote: > Will the MVP include being publicly accessible, i.e. anyone on the Internet can run a consumer? I suspect not,... [06:50:59] PROBLEM - puppet last run on db1027 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:19] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:56:20] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:56:39] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:56:49] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:56:50] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:09] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:57:39] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:58:19] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:29] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:23] PROBLEM - puppet last run on graphite2001 is CRITICAL: CRITICAL: puppet fail [07:17:13] RECOVERY - puppet last run on db1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:29:13] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [08:07:36] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Puppet has 1 failures [08:30:44] 6operations, 6Performance-Team: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1900375 (10faidon) >>! In T122069#1896872, @ori wrote: > I disabled puppet on mw1014 and mw1015 and excluded specific job types on each of them to see if it helps isolate the cause. > > * on mw1015: excluded cir... [08:35:38] RECOVERY - puppet last run on mw1250 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:42:57] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Puppet has 1 failures [08:58:07] RECOVERY - cassandra-a service on restbase1008 is OK: OK - cassandra-a is active [09:06:44] PROBLEM - cassandra-a service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [09:08:24] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [09:18:39] !log nodetool removenode e2813bb9-f1f2-4d21-ac19-95a7a35b4513 in preparation for adding 1004 to the cluster without bootstrap [09:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:19:54] PROBLEM - cassandra-a service on restbase1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [09:24:23] PROBLEM - logstash process on logstash1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [09:31:04] PROBLEM - puppet last run on mw2020 is CRITICAL: CRITICAL: Puppet has 1 failures [09:34:31] RECOVERY - cassandra-a service on restbase1004 is OK: OK - cassandra-a is active [09:36:20] RECOVERY - cassandra-a CQL 10.64.32.192:9042 on restbase1004 is OK: TCP OK - 0.001 second response time on port 9042 [09:41:50] RECOVERY - Disk space on restbase1008 is OK: DISK OK [09:43:01] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: puppet fail [09:43:22] RECOVERY - cassandra-a service on restbase1008 is OK: OK - cassandra-a is active [09:49:22] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/260697 (owner: 10Dzahn) [09:49:31] (03PS2) 10Alexandros Kosiaris: package::builder: move role to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260697 (owner: 10Dzahn) [09:51:56] !log wiped & started boostrap on restbase1008 [09:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:52:04] !log rebuilding restbase1004 [09:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:54:41] RECOVERY - puppet last run on mw2020 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:58:00] PROBLEM - puppet last run on mw1185 is CRITICAL: CRITICAL: Puppet has 1 failures [10:01:10] RECOVERY - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is OK: TCP OK - 0.000 second response time on port 9042 [10:09:32] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:11:16] !log restarting and reconfiguring mysql at db2035 [10:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:20:00] 6operations: No postinst, preinst, etc for linux-image-3.19.0-2-amd64 - https://phabricator.wikimedia.org/T122284#1900493 (10yuvipanda) 3NEW [10:20:33] 6operations: No postinst, preinst, etc for linux-image-3.19.0-2-amd64 - https://phabricator.wikimedia.org/T122284#1900500 (10yuvipanda) Test instance with this package installed (but still older kernel) is congratulatory-green-hair.eqiad.wmflabs. [10:23:32] RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [10:29:15] paravoid: *waves*. When you're done with tcpdumping on tools-worker-07, would it be possible to strace a ssh login? That might help to figure out why ssh logins hang [10:29:28] because NFS hangs [10:29:51] sure, but sshd is not supposed to hit NFS for root logins [10:30:26] we already found one stupid thing that made it hit NFS, I hope there isn't more :) [10:31:44] YuviPanda: do you mean the hints file, or did you find something else recently? [10:32:13] no the hints file [10:35:08] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 833 [10:36:06] it gets stuck at... [10:36:25] stat("/home/ssh-key-ldap-lookup/.local/lib/python2.7/site-packages") [10:36:29] wtf :) [10:36:38] definitely a bug [10:36:41] wtf [10:36:43] lol [10:37:19] so "ssh-key-ldap-lookup root" gets stuck in the D state [10:38:56] Yeah, it should probably just have something on /var as home directory? [10:39:34] or /run, or just /. Not sure. [10:39:39] can /dev/null be a homedir? [10:39:46] /nonexistent is the standard one for system users [10:39:48] ah [10:39:50] oka [10:39:53] well, for nobody [10:40:03] *nod* [10:40:06] lemme write up a patch [10:40:08] system users can have their own under /var, depending on their purpose [10:40:15] (03PS1) 10Yuvipanda: ldap: Set home for the LDAP lookup user [puppet] - 10https://gerrit.wikimedia.org/r/260734 [10:40:16] valhallasw`cloud: too late [10:40:18] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 999 [10:40:27] valhallasw`cloud: ^^ [10:40:29] YuviPanda: also set shell? [10:40:33] to /bin/false [10:41:10] hm, or maybe it needs it to be /bin/sh for sshd to call the script? not sure [10:41:14] (03PS2) 10Yuvipanda: ldap: Set home for the LDAP lookup user [puppet] - 10https://gerrit.wikimedia.org/r/260734 [10:41:23] valhallasw`cloud: I don't think it needs to be? why would that matter? [10:41:52] (03CR) 10Merlijn van Deen: [C: 031] ldap: Set home for the LDAP lookup user [puppet] - 10https://gerrit.wikimedia.org/r/260734 (owner: 10Yuvipanda) [10:42:01] paravoid: I'm thinking of making half the nodes get to 3.19 and half remain on 4.2. that sound ok? [10:43:40] sgtm [10:43:48] (03PS3) 10Yuvipanda: ldap: Set home for the LDAP lookup user [puppet] - 10https://gerrit.wikimedia.org/r/260734 (https://phabricator.wikimedia.org/T104327) [10:43:53] paravoid: ok [10:45:00] (03PS1) 10Alexandros Kosiaris: Introduce bohrium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/260735 (https://phabricator.wikimedia.org/T116312) [10:45:21] so I set ~ to /nonexistent and shell to /bin/false [10:45:25] manually [10:45:28] on -07 [10:45:31] root logins work [10:45:48] (03CR) 10Faidon Liambotis: [C: 032] ldap: Set home for the LDAP lookup user [puppet] - 10https://gerrit.wikimedia.org/r/260734 (https://phabricator.wikimedia.org/T104327) (owner: 10Yuvipanda) [10:46:00] paravoid: awesome! \o/ [10:46:59] (after invalidating th !@%$!%#!@!@ nscd cache of course) [10:47:13] paravoid: you should also document the vnc stuff somewhere... [10:47:24] what about it? [10:47:35] was it on wikitech at all? [10:47:50] it's just ssh -L 5900:localhost:590X labvirtNNNN and then vnc on localhost [10:48:08] and you can find the port with virsh or netstat [10:48:44] (03PS2) 10Alexandros Kosiaris: Introduce bohrium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/260735 (https://phabricator.wikimedia.org/T116312) [10:49:06] 6operations, 6Performance-Team: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1900524 (10ori) >>! In T122069#1900375, @faidon wrote: >>>! In T122069#1896872, @ori wrote: >> I disabled puppet on mw1014 and mw1015 and excluded specific job types on each of them to see if it helps isolate the... [10:50:18] RECOVERY - check_mysql on db1008 is OK: Uptime: 152002 Threads: 179 Questions: 8506300 Slow queries: 2107 Opens: 6392 Flush tables: 2 Open tables: 405 Queries per second avg: 55.961 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:54:41] did gerrit just die? [10:55:00] seems ok to me [10:55:01] wfm [10:55:16] ok, just me (typo) [10:55:20] phew :) [10:55:28] * YuviPanda turns on autocorrect on aude's keyboards [10:55:32] lol [10:55:48] !log rebooting and reconfiguring mysql on db2041 [10:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:56:00] hmm [10:56:06] GRUB_DEFAULT=2 doesn't seem to work [10:56:08] but fuck, it's 3AM [10:56:11] i should go to sleep [10:56:20] paravoid: <3 thanks for the debug :D [10:56:27] valhallasw`cloud: thanks for remembering about ssh :D [10:56:27] sleep well yuvi :) [10:56:32] I shall indeed! [10:56:54] stupid phabricator shows 2 notifications next to the alarm bell [10:57:04] but if I click on it, it says 'no unread notifications' [10:59:21] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce bohrium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/260735 (https://phabricator.wikimedia.org/T116312) (owner: 10Alexandros Kosiaris) [10:59:40] YuviPanda: good night [10:59:56] paravoid: hah, yes, it always does that for me as well. There's a 'clear all messages' button, though [11:11:55] !log roll-upgrade cassandra to 2.1.12 on aqs100[123] [11:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:15:29] PROBLEM - cassandra CQL 10.64.0.123:9042 on aqs1001 is CRITICAL: Connection refused [11:15:49] PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection refused [11:16:39] PROBLEM - restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [11:16:50] PROBLEM - restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [11:16:59] PROBLEM - restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [11:17:33] that's me ^ [11:18:56] ok godog, I was wondering :) [11:19:16] Thanks for upgrading to the nerw version :) [11:19:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Needs a" [puppet] - 10https://gerrit.wikimedia.org/r/260047 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [11:19:52] np joal, it is taking much longer than I expected to start back up tho [11:20:14] godog: Loooooooot of data in those babies [11:20:34] So reloading and ensuring consistency takes long time [11:21:08] ACKNOWLEDGEMENT - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection refused Filippo Giunchedi cassandra upgrade/restart in progress [11:21:10] ACKNOWLEDGEMENT - cassandra CQL 10.64.0.123:9042 on aqs1001 is CRITICAL: Connection refused Filippo Giunchedi cassandra upgrade/restart in progress [11:21:10] ACKNOWLEDGEMENT - restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) Filippo Giunchedi cassandra upgrade/restart in progress [11:21:11] ACKNOWLEDGEMENT - restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) Filippo Giunchedi cassandra upgrade/restart in progress [11:21:11] ACKNOWLEDGEMENT - restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) Filippo Giunchedi cassandra upgrade/restart in progress [11:21:26] indeed [11:21:56] godog: you have restarted them one after the or all together ? [11:23:58] joal: just aqs1001 now [11:24:09] ok godog [11:24:41] !log reloading and reconfiguring mysql on db2049 [11:24:45] I wonder why tests on aqs1002 and 1003 fails though :( [11:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:27:05] joal: likely because restbase can't talk to all three [11:27:19] Ahh, makes sense godog [11:29:49] 6operations, 10Incident-Labs-NFS-20151216: Investigate need and candidate for labstore100(1|2) kernel upgrade - https://phabricator.wikimedia.org/T121903#1900607 (10faidon) We also seem to get kernel backtraces every day at 01:00 (seems like LVM snapshot related, probably some periodic job?): ``` Dec 22 01:00:... [11:32:49] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:34:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:36:29] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:39:19] RECOVERY - restbase endpoints health on aqs1002 is OK: All endpoints are healthy [11:39:50] RECOVERY - restbase endpoints health on aqs1001 is OK: All endpoints are healthy [11:40:00] RECOVERY - restbase endpoints health on aqs1003 is OK: All endpoints are healthy [11:40:09] RECOVERY - cassandra CQL 10.64.0.123:9042 on aqs1001 is OK: TCP OK - 0.010 second response time on port 9042 [11:40:22] godog: makes me feel good when warning stops :) [11:40:29] RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 0.000 second response time on port 9042 [11:40:41] joal: heheh I'm good to with aqs1002, that'd be likely another ~20min before it starts up though [11:40:54] np :) [11:41:06] At least it seems nothing gets broken in the process, wich is good :) [11:42:11] joal: indeed, how frequently are you putting data there btw? is there a daily job or sth like that? [11:43:03] godog: multiple jobs, one hourly (small), three daily (1 big, 1 medium, 1 small), 1 monthly (smalL0 [11:44:10] ack, thanks! [11:44:17] np :) [11:44:19] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:50:19] PROBLEM - cassandra CQL 10.64.32.175:9042 on aqs1002 is CRITICAL: Connection refused [11:50:41] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection refused [11:56:47] (03PS1) 10Reedy: Set wgLocaltimezone for orwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260745 (https://phabricator.wikimedia.org/T122273) [12:01:42] (03CR) 10Alexandros Kosiaris: "Oh, we just need to deploy a version that falls back to reading from the repo instead of /etc/cxserver/config.yaml the configuration for r" [puppet] - 10https://gerrit.wikimedia.org/r/260575 (owner: 10KartikMistry) [12:07:08] (03CR) 10Nikerabbit: CX: Use config.yaml to read registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/260575 (owner: 10KartikMistry) [12:09:20] (03CR) 10KartikMistry: CX: Use config.yaml to read registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/260575 (owner: 10KartikMistry) [12:10:20] (03PS1) 10Alexandros Kosiaris: bohrium: Set up DHCP/PXE and site.pp parameters [puppet] - 10https://gerrit.wikimedia.org/r/260749 (https://phabricator.wikimedia.org/T116312) [12:10:31] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 0.015 second response time on port 9042 [12:11:51] RECOVERY - cassandra CQL 10.64.32.175:9042 on aqs1002 is OK: TCP OK - 0.003 second response time on port 9042 [12:13:40] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - kartotherian_6533 - Could not depool server maps-test2002.codfw.wmnet because of too many down! [12:14:00] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - kartotherian_6533 - Could not depool server maps-test2002.codfw.wmnet because of too many down! [12:14:13] !log upgrade cassandra on aqs1003 [12:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:16:38] (03CR) 10Alexandros Kosiaris: [C: 032] bohrium: Set up DHCP/PXE and site.pp parameters [puppet] - 10https://gerrit.wikimedia.org/r/260749 (https://phabricator.wikimedia.org/T116312) (owner: 10Alexandros Kosiaris) [12:16:52] PROBLEM - Analytics Cassanda CQL query interface on aqs1003 is CRITICAL: Connection refused [12:17:11] PROBLEM - cassandra CQL 10.64.48.117:9042 on aqs1003 is CRITICAL: Connection refused [12:21:27] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1900694 (10jcrespo) [12:23:52] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1900696 (10jcrespo) #DBA / #blocked-on-schema-change have been created and documented and in production: https://wikitech.wikimedia.org/wiki/Schema_changes https://wikitech.wikimedia.... [12:29:21] godog: wth is going on today??? [12:30:02] mobrovac: what do you mean? [12:30:27] godog: cass on both rb and aqs alerts [12:30:32] no disk space on rb1008 etc [12:30:35] !log restart and reconfigure mysql at db2056 [12:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:30:45] our infra doesn't seem to be in holiday mood [12:32:04] mobrovac: hehe aqs is the upgrade, rb is known [12:34:42] yeah, known but still ... [12:40:31] PROBLEM - puppet last run on mw1037 is CRITICAL: CRITICAL: Puppet has 1 failures [12:41:01] RECOVERY - cassandra CQL 10.64.48.117:9042 on aqs1003 is OK: TCP OK - 0.001 second response time on port 9042 [12:41:02] !log reenabling event scheduler on db1046 (eventlogging m4-master) [12:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:42:03] RECOVERY - Analytics Cassanda CQL query interface on aqs1003 is OK: TCP OK - 0.002 second response time on port 9042 [12:46:02] 6operations, 10RESTBase-Cassandra: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1900714 (10fgiunchedi) upgrade completed on aqs100[123], also note that due to how much data these nodes have ATM on spinning disks it takes ~20min for each node to start back up, we've seen restbase throw... [12:46:18] 6operations, 10RESTBase-Cassandra: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1900715 (10fgiunchedi) upgrade completed on aqs100[123], also note that due to how much data these nodes have ATM on spinning disks it takes ~20min for each node to start back up, we've seen restbase throw... [13:01:25] (03PS1) 10Filippo Giunchedi: admin: add filippo@ yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/260753 [13:02:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] admin: add filippo@ yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/260753 (owner: 10Filippo Giunchedi) [13:07:03] RECOVERY - puppet last run on mw1037 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [13:14:53] PROBLEM - puppet last run on ganeti2005 is CRITICAL: CRITICAL: puppet fail [13:16:53] PROBLEM - puppet last run on mw2112 is CRITICAL: CRITICAL: puppet fail [13:26:46] !log restart and reconfigure mysql at db2063 [13:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:39:56] RECOVERY - puppet last run on ganeti2005 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [13:45:06] RECOVERY - puppet last run on mw2112 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:02:11] akosiaris, MaxSem any thoughts? https://phabricator.wikimedia.org/T122270 [14:05:11] !log restart and reconfigure mysql at db2064 [14:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:11:24] akosiaris, did you shut down postgres on maps2002? [14:12:07] 6operations, 6Discovery, 10Maps: Tilerator Error: permission denied for relation planet_osm_polygon - https://phabricator.wikimedia.org/T122270#1900861 (10Yurik) [14:17:18] 6operations, 10Traffic: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1900868 (10Krenair) [14:18:02] 6operations, 10Traffic: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1900753 (10Krenair) sounds more like traffic/network ops than software architecture [14:25:22] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:25:49] akosiaris, MaxSem something major broke on maps [14:27:16] jynus, need help, maps just fell apart, no idea why [14:27:24] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 966194 bytes in 9.028 second response time [14:28:36] yurik: permissions on the tables. As expected due to osm2pgsql recreating the tables. I am handling it [14:29:17] akosiaris, i don't think its related - cassandra should still be working, yet the maps.wikimedia.org is all crazy [14:32:14] hmmm that's bad [14:32:38] akosiaris, that is bad. I'm connecting to 6533 on all maps, and 2001 works fine [14:33:07] yes 2001 being the only one working fine it to be expected [14:33:18] I 've stopped kartotherian and tilerator on the other ones [14:33:23] as I am fixing the permissions [14:33:28] akosiaris, why did you stop kartotherian? [14:33:35] it doesn't use psql [14:33:46] just being on the safe side [14:33:57] plus as far as 2001 is still working, LVS uses that one [14:34:03] it gets data from cassandra [14:34:04] and we can survive the load [14:34:19] either lvs is broken, or something else is [14:34:20] no idea [14:34:20] yeah sure, but stopping it should not have hurt like that [14:34:24] true [14:34:36] so, yesterday I also upgraded cassandra [14:34:47] but nodetool status said everything was fine [14:34:48] bleh, too many upgrades at once [14:34:49] not good [14:35:06] tbh, I never realized why we did the cassandra upgrade [14:35:14] at the first place at all [14:35:32] but again, 2.1.8 => 2.1.12 [14:35:41] unless 2.1.12 is very very seriously broken [14:35:45] that should not have happened [14:37:01] akosiaris, maps.wikimedia.org gives me 503 [14:37:19] i think varnish or lvs is broken [14:37:31] i'm starting back the kartotherian [14:37:33] it should be safe [14:37:49] FWIW we haven't seen cassandra issues with restbase load with 2.1.12 [14:38:14] i don't think this is cassandra - i'm seeing maps just fine when ssh directly to maps2001 [14:38:25] yurik: wait, don't start kartherian yet [14:38:26] which picks up data from casandra and renders it without varnish [14:38:31] ok [14:38:37] let's debug this a bit [14:38:54] so, kartotherian works fine on maps-test2001 [14:38:57] that's good [14:39:06] yes [14:39:41] to test: ssh -L 6533:localhost:6533 maps-test2001.codfw.wmnet - browse http://localhost:6533 [14:40:02] i think its either varnish or lvs [14:40:10] bblack ? [14:41:12] yurik: I don't think he is around [14:41:18] as in xmas not around [14:41:51] i'm seeing either 503, or sometimes it loads but fails to load leaflet.css -- which is what causes the map to be crazy [14:42:31] that's good then [14:42:41] if leaflet.css is the problem I think we can fix that [14:43:39] leaflet request shows cp1044 miss(0), cp1043 frontend miss(0) [14:43:51] and its 503 [14:43:52] 6operations, 10Traffic: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1900911 (10Aklapper) @Trizek-WMF: Have Orange already tried contacting answers@wikimedia.org ? Reminds me of https://phabricator.wikimedia.org/T110208 ... [14:44:24] which is weird - leaflet is the most requested resource from varnishes [14:45:09] also, if i try to load it directly, it works fine [14:45:16] from the same varnish servers [14:45:25] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [14:46:10] has anything been changed in varnish config recently? [14:46:26] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [14:46:39] since yesterday ? no [14:46:57] and even more than that [14:47:08] i'm not exactly sure when this started though [14:47:08] ah ok [14:47:11] so refresh [14:47:17] yeee! [14:47:21] what was the problem? [14:47:31] me having stopped kartotherian [14:47:39] seems after all 1 server could not handle the increased load [14:47:52] of having the databases resynced and serving traffic [14:47:54] it was also processing tilerator [14:48:00] oh and that too [14:48:08] 4 threads [14:48:21] its strange that it was specifically leaflet.css [14:48:35] so leaflet.css failing was the problem.. ok [14:48:44] I was afraid we had messed up the tiles somehow [14:48:59] yeah, so was i [14:49:36] but this did uncover some interestig problems -- it seems js and css are not being cached by the varnish [14:49:48] i might not be setting the right headers [14:49:59] could be [14:50:05] i have support for this, but might need to change configs [14:51:22] 6operations, 10Traffic: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1900914 (10Krenair) answers@ is listed as an address for legal at https://meta.wikimedia.org/wiki/Legal#Legal_Team_Email_Contacts so it's probably not very useful in this case That said, they als... [14:51:22] akosiaris, where did you see that it was the bottleneck? CPU-wise it hasn't changed much [14:51:34] https://ganglia.wikimedia.org/latest/?c=Maps%20Cluster%20codfw&h=maps-test2001.codfw.wmnet& [14:51:42] look at 14:00 UTC [14:51:54] which is I presume the time you started tilerator ? [14:52:28] yes [14:52:47] ok, so that was the actual reason this started failing [14:52:54] the service was fine up to that point [14:52:59] even with the 1 backend [14:53:22] but after that, for some reason kartotherian stopped serving leaflet.css [14:53:31] at least that's my current theory [14:53:41] I suppose kartotherian's logs should help more [14:53:49] i'm not sure i follow - this is a sepaarate, cpu+sql processes. Kartotherian should work as before [14:54:10] especially in such a predictable fashion of failing one specific resource [14:54:32] we don't log every request - would kill the logs [14:56:00] oh it is, not arguing with that [14:56:10] a separate process I mean [14:56:35] and i used to test it with 100% cpu before - was working fine [14:56:39] but somehow the CPU + IO wait contention made kartotherian on maps-test2001 fail to serve leaflet.css [14:56:56] i used to run not 4 but 8 tilerators at once [14:57:00] and it was still working ok [14:57:05] (for kartotherian) [14:57:11] yeah cause the rest of the backends were fine [14:57:23] they all ran 100% cpu [14:57:35] one tilerator per core on all 4 machines [14:57:47] sometimes i would even have higher number than cores [14:57:54] on the other hand, it's true kartotherian worked fine in local lookups [14:58:09] I am wondering if it was just stalling and some timeout kicked in [14:58:49] that's possible - something by varnish? yet the weird thing is that it would *always* fail if requested as a resource, and *always* succeed if requested directly by the browser [14:58:51] ah hmm unless it was not CPU contantion [14:59:08] ah got it [14:59:13] it was network contention https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=maps-test2001.codfw.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Maps+Cluster+codfw [14:59:22] ok that makes more sense now [14:59:32] that's probably it [14:59:55] ah, yeah, that's more like it -- so who was using all tha bandwidth? [15:00:05] cassandra or psql sync? [15:00:07] s/was/is/ [15:00:11] psql sync [15:00:29] makes sense [15:00:58] QoS traffic shapping? [15:01:17] e.g. reduce the psql priority [15:01:53] or just depool the master when doing it [15:02:34] by stoping kartotherian? [15:03:47] that or setting pybal config. both will pretty much do the same thing [15:03:54] however the latter is way more polite [15:04:03] as it will allow in flight requests to be served [15:04:04] where is it, and do i have access ? [15:04:11] pybal ? no you dont [15:04:18] than its a moot point )) [15:04:24] actually it's not [15:04:40] there's an ongoing effort to integrate it to etcd [15:04:45] etcd? [15:04:50] which is going rather well [15:05:10] let's just say it's a REST datastore that will allow to do a REST request to pool/depool a server [15:05:17] the idea is to use it soon in scap3 [15:05:25] so you will get that for free [15:05:43] not sure on the ETA though [15:05:55] good to know, thx. I wonder why logs are all broken [15:06:07] kartotherian hasn't been logging anything since forever [15:06:12] andneither is tilerator [15:06:13] but the idea is to depool a server, update , repool [15:08:40] 6operations, 10Traffic: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1900936 (10Aklapper) >>! In T122293#1900914, @Krenair wrote: > answers@ is listed as an address for legal at https://meta.wikimedia.org/wiki/Legal#Legal_Team_Email_Contacts so it's probably not v... [15:16:20] akosiaris, do you know what headers varnish honors as far as caching policy? [15:19:43] 6operations, 6Analytics-Kanban, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1900948 (10Ottomata) No, consumption is not part of the MVP. There may be future work to make consumption from Kafka via websockets easy to set up, but we will not make any Events... [15:25:48] yurik: https://www.varnish-software.com/book/3/HTTP.html look at cache related headers [15:26:24] varnish is very very strict in compliance to HTTP [15:28:23] (03PS3) 10Ottomata: Add LVS/PyBal config for eventbus [puppet] - 10https://gerrit.wikimedia.org/r/260047 (https://phabricator.wikimedia.org/T118780) [15:30:30] akosiaris, i think bblack told me once that it actually ignores some of them, like expires [15:30:39] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks ok to me. When do you want to merge?" [puppet] - 10https://gerrit.wikimedia.org/r/260047 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [15:30:44] it might be for older varnish that we use [15:30:48] akosiaris: le'ts do it! [15:31:37] ottomata: err, I got to run in like 5-10 mins [15:31:48] ottomata: if you are ready in your end [15:31:51] all ready [15:31:57] I can merge it by myself tomorrow morning [15:32:00] oh ok [15:32:03] european morning [15:32:06] ;-) [15:32:10] will it potentially hurt things to do it now if you aren't around? [15:32:15] if it doesn't work its ok [15:32:27] nothing is using it yet [15:32:52] ah.. I just realized it's missing things [15:32:56] good that you asked [15:33:05] oh? k what's it need? [15:33:07] I was thinking what it might hurt and [15:33:23] 6operations, 10Traffic: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1900968 (10Trizek-WMF) @Krenair: thanks for triage :) >>! In T122293#1900936, @Aklapper wrote: > As far as I understand it, Legal is just **one** topic potentially "covered" by answers@. answer... [15:34:38] ottomata: modules/role/manifests/lvs/balancer.pp [15:34:48] add it to the corresponding lvs pair as well [15:35:04] low-traffic are 1003, 1006 [15:35:41] with aqs k [15:36:58] hmm, akosiaris did I somehow lose the service_ips in common.yaml?/? [15:37:06] sorry [15:37:08] configuration.yaml [15:37:18] hm [15:37:24] I did! [15:37:51] ah, yes you did ... [15:38:16] coming in... [15:39:25] (03PS4) 10Ottomata: Add LVS/PyBal config for eventbus [puppet] - 10https://gerrit.wikimedia.org/r/260047 (https://phabricator.wikimedia.org/T118780) [15:40:25] a /v1/topics.. even better [15:41:01] (03CR) 10Alexandros Kosiaris: [C: 031] Add LVS/PyBal config for eventbus [puppet] - 10https://gerrit.wikimedia.org/r/260047 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [15:41:34] cool sooOoOo merging? :) [15:41:42] so ottomata, if you do want to merge, a) don't restart pybal on lvs1003 and lvs1006 too close to each other (give it like 5 mins) . b) make sure we don't get a page if it is not working [15:42:02] as in schedule a downtime in icinga really quickly [15:42:20] I can't think of anything else [15:42:34] it'll page page? [15:42:37] like sms page? [15:43:01] yes [15:43:11] hm, can we tell it not to do that in puppet? [15:43:20] and enable that later? [15:43:27] not really at this point [15:43:31] hm [15:43:45] we would have to add some extra stuff into that yaml config [15:43:47] hah, oook, so after merge first step is to get the alert on neon and schedule downtime [15:43:53] could I just scheduled downtime for host and allservices now? [15:44:19] not sure if a new service would be contained by doing down time now...guess it should [15:44:20] no, cause the eventbus.svc.eqiad.wmnet host does not exist yet [15:44:21] can't hurt [15:44:33] aah, did you merge the DNS change ? [15:44:35] host, hm, oh because that is on the lvs monitoring [15:44:35] hm [15:44:37] yes [15:44:57] cool [15:45:12] ok, let's do it, i'll get the alert in icinga and quiet it as quickly as I can just in case [15:45:16] ok [15:45:23] hm, lemme double check and make sure I can curl from lvs1003 [15:45:43] ja looks good [15:45:44] ok lets do it [15:45:59] (03CR) 10Ottomata: [C: 032] Add LVS/PyBal config for eventbus [puppet] - 10https://gerrit.wikimedia.org/r/260047 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [15:47:04] akosiaris: so, this requires a manual pybal restart on lvs100[36]? [15:47:12] exactly [15:47:16] but don't do them together [15:47:18] yeah [15:47:37] am running puppet on kafka100[12] and those lvss now, then will run on neon and disable alert [15:49:08] akosiaris: is this hte PyBal backends health check? [15:53:19] ottomata: no it will be an eventbus.svc.eqiad.wmnet host [15:53:25] with one check PENDING [15:53:43] what host will be that on in icinga? [15:53:51] and, does thoat not show up until I restart pybal? [15:54:03] i didn't that added by puppet on neon [15:54:20] it will show up as soon as puppet runs on neon and icinga is reloaded [15:54:32] hm, i did that [15:54:35] nice race condition, heh ? ;-) [15:54:41] ha, yeah [15:54:42] heh [15:54:46] running puppet again i guess? [15:54:55] akosiaris: what node is that on, graphite100? [15:55:05] heh ? [15:55:07] 1001* [15:55:16] what's graphite got to do with anything ? [15:55:29] I lost you there [15:55:30] sometimes monitoring checks that aren't speciifc to a certain host are tehre [15:55:42] oh the threshold checks [15:55:46] when you say eventbus.svc.eqiad.wmnet host [15:55:49] yeah, nothing to do with that [15:55:52] you mean one of the realservers? [15:55:54] no [15:55:58] it's the LVS IP [15:56:01] it's "virtual" [15:56:03] OH, its a fake host! [15:56:03] ohhh [15:56:11] cool [15:56:15] ok [15:56:25] ok, i'm running puppet on neon again [15:56:38] i've already run puppet on realservers and lvss, and i see the config changes applied [15:56:46] i haven't restarted pybal yet [15:57:01] ok, so it's something like that https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=mathoid.svc.eqiad.wmnet [15:57:17] cool, got it [15:57:18] makes sense now [15:57:19] akosiaris, seems like itsdone [15:57:24] i'm starting tilerator [15:57:36] the network has dropped [15:57:40] (usage) [15:57:43] yurik: yes it's done [15:57:46] about 10 mins ago [15:57:51] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:57:56] akosiaris: do I also need to run puppet on 1009 and 1012? [15:57:56] +# LVS balancers: lvs1003 lvs1006 lvs1009 lvs1012 [15:58:09] hmmm [15:58:25] ahh there/s the new check! [15:59:08] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=eventbus.svc.eqiad.wmnet [15:59:13] ah great [15:59:14] scheduling downtime [15:59:17] ok [15:59:32] akosiaris: i ran puppet on 1009 and 1012, and the configs were added [15:59:36] so maybe I need to restart pybal there too? [15:59:37] ok [15:59:49] yeah, those are new, I haven't had to mess with them yet ... [15:59:55] but yeah do [16:00:00] ok, downtime scheduled, am restarting pybal on 1003 [16:00:54] seeing warns for WARN: restbase1004.eqiad.wmnet(disabled/down/not pooled): [16:01:06] in pybal.log, but maybe that was happening before too? [16:01:46] RECOVERY - DPKG on labmon1001 is OK: All packages OK [16:02:06] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures [16:02:15] yeah ottomata that host is depooled atm [16:02:23] ok [16:02:43] ok, giving this a few mins, then will restart pybal on 1006... [16:06:17] akosiaris: do I need to add pybal files that will show up in config-master... somewhere? ( i forget where those are thesedays) [16:06:26] (just restarted pybal on 1006) [16:07:25] /srv/pybal-config/pybal/eqiad on palladium? [16:07:29] should I add eventbus file htere? [16:08:14] godog: ^ ? do you know? [16:08:29] yes [16:08:35] ottomata: yes do that [16:08:40] filename: eventbus [16:08:42] k [16:08:44] sorry forgot about that [16:09:13] commit too? [16:10:05] did. [16:10:13] http://config-master.wikimedia.org/pybal/eqiad/eventbus [16:11:07] (restarted pybal on 1009) [16:17:25] (restarted pybal on 1012) [16:19:39] looking good! [16:20:09] wow, indeed [16:20:11] nice [16:20:37] I 've never managed to have a new LVS service up and running without some form of hiccup up to now [16:22:28] Yeah, the debian bootstrap hasn't liked LVS for a while; which surprises me no end given how ubiquitously it is used. [16:28:01] ha awesome! [16:28:03] thanks for the help! [16:28:07] i'm removing the downtime [16:29:05] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:37:36] running home, back shortly [16:38:54] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1901064 (10demon) @coren asked me to chime in since holidays and freezing and such. This is fine. {{approved}} [16:40:43] PROBLEM - salt-minion processes on cygnus is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:44:40] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1901078 (10Krenair) Isn't 2016-01-11 going to be after the freeze? Or are you actually approving it to be done during the freeze? [17:00:05] akosiaris RobH: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151223T1700). [17:00:30] eh? that should be tomorrow i thought... no patches anyhow [17:00:49] akosiaris: did you shift it to wednesday? or did I make a mistake when I added my name? (either is fine, no patches ;) [17:01:04] (wednesday makes sense since tomorrow is christmas eve, just curious) [17:05:47] (03CR) 10Dzahn: [C: 031] "ticket says to wait until today and there are no objections, looks like it can be merged" [puppet] - 10https://gerrit.wikimedia.org/r/260442 (owner: 10RobH) [17:06:34] RECOVERY - salt-minion processes on cygnus is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:06:37] (03PS3) 10RobH: setting up shell access for Casey Dentinger [puppet] - 10https://gerrit.wikimedia.org/r/260442 [17:06:40] robh: maybe the access request then ?) ah [17:06:46] its on my list [17:06:49] but i'll do now ;] [17:07:01] cool, just happened to see that date [17:07:08] i kind of was trying to knock down the procurement stuff first ;] [17:07:35] (03CR) 10RobH: [C: 032] setting up shell access for Casey Dentinger [puppet] - 10https://gerrit.wikimedia.org/r/260442 (owner: 10RobH) [17:07:38] come on zuuuuuul! [17:07:51] mutante: at least today we arent in ff battle with everyone =] [17:08:02] there it is [17:08:10] heh, yea, keeping hands off [17:08:22] nah i apprecaite the +1 cuz it shows im not just merging shit! [17:08:40] and i end up merging a lot of crap in installs without reviews so having them when possible is best [17:08:45] yea, but not the rebase button [17:08:53] its merged [17:09:02] great [17:09:33] heh, im ordering and merging things today, its my gift to everyone.... [17:09:35] PROBLEM - puppet last run on mw2195 is CRITICAL: CRITICAL: puppet fail [17:09:43] that im paid to do on a daily basis..... [17:10:05] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Puppet has 1 failures [17:10:11] icinga-wm: i dont believe you about those :/ [17:10:50] only if it happens to half a dozen ore more =P [17:10:54] hrmm, looking anyways [17:11:26] !log cloning s2 databases from dbstore2001 to dbstore2002 (s2 replication disabled on both) [17:11:27] (03CR) 10Andrew Bogott: [C: 031] Remove now obsolete OpenDJ server module and related templates/files [puppet] - 10https://gerrit.wikimedia.org/r/260542 (owner: 10Muehlenhoff) [17:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:13:09] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for cwdent - https://phabricator.wikimedia.org/T121916#1901130 (10RobH) 5stalled>3Resolved No objections, so this is now live. I've also forced a puppet run on both bast1001.wikimedia.org and stat1002.eqiad.wmnet, so they have your keys. T... [17:13:43] (03PS2) 10Andrew Bogott: openstack: rename openstack-manager class [puppet] - 10https://gerrit.wikimedia.org/r/260192 (owner: 10Dzahn) [17:14:03] 6operations, 10Traffic: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1901138 (10Dzahn) This actually sounds more like it needs a #netops. [17:14:05] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:14:32] 6operations, 10Traffic, 10netops: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1901139 (10Dzahn) [17:14:56] (03CR) 10Andrew Bogott: [C: 032] openstack: rename openstack-manager class [puppet] - 10https://gerrit.wikimedia.org/r/260192 (owner: 10Dzahn) [17:15:06] 6operations, 10Traffic, 10netops: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1900753 (10Dzahn) @Faidon It sounds to me like Orange wants to talk to a network engineer about routing issues with Tele2. [17:15:34] RECOVERY - puppet last run on mw2195 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:16:30] thank you Andrew [17:18:02] mutante, yeah, wasn't sure to what extent traffic covered networking [17:18:06] thanks for clearing it up [17:19:24] Krenair: it feels like they want to talk about routing and Tele2 [17:20:46] (03PS3) 10Andrew Bogott: openstack: rename queue-server to queue_server [puppet] - 10https://gerrit.wikimedia.org/r/260188 (owner: 10Dzahn) [17:21:27] probably [17:23:20] (03CR) 10Andrew Bogott: [C: 032] openstack: rename queue-server to queue_server [puppet] - 10https://gerrit.wikimedia.org/r/260188 (owner: 10Dzahn) [17:27:06] (03CR) 10coren: "Shouldn't be the case - not having access.conf lets any valid user in; but not having PAM use LDAP means the only valid users are local on" [puppet] - 10https://gerrit.wikimedia.org/r/257411 (https://phabricator.wikimedia.org/T120710) (owner: 10coren) [17:28:12] (03CR) 10Andrew Bogott: "Blindly applying Alex's suggestions in hopes of getting this train back on the tracks..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [17:28:22] (03PS1) 10Merlijn van Deen: dynamicproxy: make banning users easier [puppet] - 10https://gerrit.wikimedia.org/r/260768 [17:28:41] (03PS2) 10Merlijn van Deen: dynamicproxy: make banning users easier [puppet] - 10https://gerrit.wikimedia.org/r/260768 (https://phabricator.wikimedia.org/T122307) [17:28:54] (03CR) 10coren: [C: 04-1] "But also, this patch needs to be updated to remove wikimedia-labs-pam too as this no longer makes sense given security::access and securit" [puppet] - 10https://gerrit.wikimedia.org/r/257411 (https://phabricator.wikimedia.org/T120710) (owner: 10coren) [17:29:52] (03PS9) 10Andrew Bogott: Install Extension:Translate on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [17:32:28] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1901208 (10coren) @Krenair: That's what I asked @demon to okay; we want to be able to do press releases in January so the more time to set the wiki up the better. :-) [17:32:45] (03PS3) 10Merlijn van Deen: dynamicproxy: make banning users easier [puppet] - 10https://gerrit.wikimedia.org/r/260768 [17:34:04] andrewbogott: ^ could you take a look? [17:34:13] yep! [17:34:33] I'm planning to stop puppet on proxy-01, deploy on -02, test, then afterwards restart puppet on -01 [17:35:21] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1901231 (10Krenair) Okay, well what day do you think we should do this on then? [17:35:22] banned_ips now comes from hiera, which is even better! (I hope ;-)) [17:35:57] /.error/labs-logo.png <- are you making different logos for the error page? [17:36:40] lemme check where those come from [17:37:35] (03PS1) 10Dzahn: add an url-downloader service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/260770 (https://phabricator.wikimedia.org/T122134) [17:38:04] andrewbogott: I think it's just png with the right size: https://github.com/wikimedia/operations-puppet/blob/production/modules/dynamicproxy/files/labs-logo-2x.png [17:38:21] it could have just (ab)used commons for it, I guess [17:38:29] (03PS2) 10Dzahn: add an url-downloader service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/260770 (https://phabricator.wikimedia.org/T122134) [17:38:42] (03PS4) 10Merlijn van Deen: dynamicproxy: make banning users easier [puppet] - 10https://gerrit.wikimedia.org/r/260768 [17:38:48] ^ this fixes the filenames for tool-labs, which were still / rather than /.error [17:42:53] Do you mean the logo files were broken before? [17:43:20] no, but the error pages used a different system [17:43:24] I've now moved everything in /.error [17:43:37] oh, ok — so the files were moved by hand, not puppetized? [17:43:38] rather than taking over / [17:43:48] they should be puppetized [17:44:14] https://github.com/wikimedia/operations-puppet/blob/production/modules/dynamicproxy/manifests/init.pp#L85 [17:44:24] I think I'm misunderstanding what you mean [17:45:39] Krenair: Heh. It's just too late for the ongoing deployment swap (ends in 15m). How do you feel about next Monday? Or are you already off on vacation? [17:45:48] I’m sure my question just doesn’t make sense. I’ll look again [17:46:25] s/swap/swat/ [17:46:30] the thing i’m confused about now is why it’s “.error” in some contexts and “error” in others [17:46:37] (03PS3) 10Dzahn: add an url-downloader service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/260770 (https://phabricator.wikimedia.org/T122134) [17:49:20] (03PS4) 10Dzahn: add an url-downloader service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/260770 (https://phabricator.wikimedia.org/T122134) [17:50:39] oh, nm, I see why [17:50:49] (03CR) 10Andrew Bogott: [C: 031] dynamicproxy: make banning users easier [puppet] - 10https://gerrit.wikimedia.org/r/260768 (owner: 10Merlijn van Deen) [17:51:32] (03PS5) 10Andrew Bogott: dynamicproxy: make banning users easier [puppet] - 10https://gerrit.wikimedia.org/r/260768 (owner: 10Merlijn van Deen) [17:53:37] 6operations, 5Patch-For-Review: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#1901332 (10Dzahn) ^ Here's a change to add one on acamar in codfw, using the same code that we use for the existing one in eqiad. How would we go about the IP and DNS entries though? We... [17:54:20] (03CR) 10Andrew Bogott: [C: 032] dynamicproxy: make banning users easier [puppet] - 10https://gerrit.wikimedia.org/r/260768 (owner: 10Merlijn van Deen) [17:54:59] Krenair: December 31st midnight UTC :p [17:58:21] 6operations, 10Traffic, 10netops: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1901341 (10Trizek-WMF) >>! In T122293#1901139, @Dzahn wrote: > @Faidon It sounds to me like Orange wants to talk to a network engineer about routing issues with Tele2. "Routing issu... [17:58:58] 6operations, 10RESTBase-Cassandra: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1901345 (10GWicke) 5Open>3Resolved a:3GWicke > note that due to how much data these nodes have ATM on spinning disks it takes ~20min for each node to start back up, we've seen restbase throwing 500s... [17:59:38] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1901350 (10coren) We just missed a window, and tomorrow is a "Friday", I think. How does Dec 28 sound to you? [18:00:02] 6operations, 10Traffic, 10netops: Orange S.A. searches a contact at WMF for tests - https://phabricator.wikimedia.org/T122293#1901356 (10Dzahn) 5Open>3stalled @Trizek-WMF thank you for the update. setting to stalled. [18:03:45] (03PS1) 10Merlijn van Deen: dynamicproxy: remove spurious <% end %> [puppet] - 10https://gerrit.wikimedia.org/r/260773 [18:03:55] andrewbogott: ^ of course I messed up ;-) [18:04:07] ok, one second... [18:04:30] there's no hurry [18:04:30] (03CR) 10Andrew Bogott: [C: 032] dynamicproxy: remove spurious <% end %> [puppet] - 10https://gerrit.wikimedia.org/r/260773 (owner: 10Merlijn van Deen) [18:04:37] \o/ thanks [18:04:53] waiting for jenkins still though [18:12:17] 6operations: No postinst, preinst, etc for linux-image-3.19.0-2-amd64 - https://phabricator.wikimedia.org/T122284#1901369 (10MoritzMuehlenhoff) Interesting, I'll check what went wrong after my vacation. In the mean time using the meta package linux-meta will fix this as well; it handles the update of the initram... [18:17:11] (03PS1) 10Merlijn van Deen: dynamicproxy: use correct syntax for a for loop [puppet] - 10https://gerrit.wikimedia.org/r/260775 [18:17:25] !log bohrium - finish install, signing puppet certs [18:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:29] (03PS2) 10Andrew Bogott: dynamicproxy: use correct syntax for a for loop [puppet] - 10https://gerrit.wikimedia.org/r/260775 (owner: 10Merlijn van Deen) [18:20:52] (03PS2) 10Madhuvishy: [WIP] wikimetrics: Puppet module for wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/260687 [18:20:58] (03CR) 10Andrew Bogott: [C: 032] dynamicproxy: use correct syntax for a for loop [puppet] - 10https://gerrit.wikimedia.org/r/260775 (owner: 10Merlijn van Deen) [18:22:05] PROBLEM - DPKG on mw2058 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:24:04] RECOVERY - DPKG on mw2058 is OK: All packages OK [18:24:36] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 3 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1901384 (10Dzahn) Alex installed the OS, i finished it by adding it to puppet. bohrium.eqiad.wmnet is up and running and has a... [18:25:47] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 3 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1901394 (10Nuria) @Dzahn: Can we reuse analytics-admins? [18:26:44] PROBLEM - puppet last run on mw2058 is CRITICAL: CRITICAL: Puppet has 20 failures [18:27:41] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 3 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1901403 (10JMinor) Can we also add @BGerstle-WMF to that group? I don't see us accessing the box directly (we'll use the web in... [18:34:06] 6operations, 10Salt: salt minions need 'wake up' test.ping after idle period before they respond properly to commands - https://phabricator.wikimedia.org/T120831#1901421 (10ArielGlenn) 5Open>3Resolved new packages in repo and installed on all production hosts except neodymium (running with local patches st... [18:34:07] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1901423 (10ArielGlenn) [18:36:42] 6operations, 7HTTPS: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#1901440 (10RobH) 3NEW a:3RobH [18:37:38] 6operations, 7HTTPS: dumps.wikimedia.org (expires 2016-02-26) - https://phabricator.wikimedia.org/T122321#1901448 (10RobH) 3NEW a:3RobH [18:38:10] 6operations, 7HTTPS: dumps.wikimedia.org (expires 2016-02-26) - https://phabricator.wikimedia.org/T122321#1901448 (10RobH) [18:38:12] 6operations, 7HTTPS: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#1901440 (10RobH) [18:38:14] 6operations, 7HTTPS: ssl certificate replacement: tendril.wikimedia.org (expires 2016-02-15) - https://phabricator.wikimedia.org/T122319#1901459 (10RobH) [18:38:15] !log restart and reconfigure mysql at db2037 [18:38:19] 6operations, 7HTTPS: dumps.wikimedia.org (expires 2016-02-26) - https://phabricator.wikimedia.org/T122321#1901448 (10RobH) [18:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:38:29] 6operations, 7HTTPS: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#1901440 (10RobH) [18:38:38] 6operations, 7HTTPS: ssl certificate replacement: tendril.wikimedia.org (expires 2016-02-15) - https://phabricator.wikimedia.org/T122319#1901432 (10RobH) [18:41:42] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1901476 (10ArielGlenn) After upgrading all but about 5 production hosts to the new packages with our patches, I now see this interesting but annoying behavior. When I was about halfway throu... [18:48:51] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 3 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1901490 (10Dzahn) The analytics-admins group is just 2 people. We briefly talked on IRC and agreed we should create a new admin... [18:52:52] RECOVERY - puppet last run on mw2058 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [19:15:54] PROBLEM - puppet last run on mw2112 is CRITICAL: CRITICAL: puppet fail [19:18:09] (03CR) 10Aaron Schulz: [C: 032] Remove unused $wgMaxSquidPurgeTitles setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260506 (owner: 10Aaron Schulz) [19:18:57] (03Merged) 10jenkins-bot: Remove unused $wgMaxSquidPurgeTitles setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260506 (owner: 10Aaron Schulz) [19:20:06] Coren, I suppose Monday is fine [19:20:34] Urgent deploy is urgent [19:20:58] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1901570 (10Krenair) Fine with me [19:21:03] Hardly urgent; just very useful yet low risk. :-) [19:21:30] I meant removing unused global [19:22:08] 10Ops-Access-Requests, 6operations: add mforns, milimetric, nuria,ottomata, madhuvishy and joal to piwik-roots - https://phabricator.wikimedia.org/T122325#1901573 (10Dzahn) 3NEW a:3JMinor [19:22:16] !log aaron@tin Synchronized wmf-config/CommonSettings.php: Remove unused $wgMaxSquidPurgeTitles setting (duration: 00m 30s) [19:22:16] 10Ops-Access-Requests, 6operations: add mforns, milimetric, nuria,ottomata, madhuvishy and joal to piwik-roots - https://phabricator.wikimedia.org/T122325#1901581 (10Dzahn) a:5JMinor>3None [19:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:23:40] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 3 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1901584 (10Dzahn) Here's the access request ticket to handle the root access: T122325 [19:29:56] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 3 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1901605 (10Dzahn) So the server is installed and running and @ottomata has access and the rest of the access is going to be han... [19:30:17] 6operations, 6Parsing-Team, 10Parsoid: Update ruthenium to 14.04 from 12.04 - https://phabricator.wikimedia.org/T122328#1901606 (10ssastry) 3NEW [19:30:55] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 3 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1901613 (10Dzahn) 5Open>3Resolved please feel free to reopen it if you think something is missing (besides the access for m... [19:31:08] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1901617 (10Dzahn) [19:37:00] (03PS1) 10RobH: new dumps.wikimeida.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/260783 [19:37:12] (03PS3) 10Madhuvishy: [WIP] wikimetrics: Puppet module for wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/260687 [19:37:54] (03PS2) 10RobH: new dumps.wikimedia.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/260783 [19:38:29] 6operations, 7HTTPS: dumps.wikimedia.org (expires 2016-02-26) - https://phabricator.wikimedia.org/T122321#1901625 (10RobH) [19:38:57] 6operations, 7HTTPS: ssl certificate replacement: dumps.wikimedia.org (expires 2016-02-26) - https://phabricator.wikimedia.org/T122321#1901627 (10RobH) [19:40:51] (03PS1) 10RobH: new tendril.wikimedia.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/260784 [19:41:07] 6operations, 7HTTPS: ssl certificate replacement: tendril.wikimedia.org (expires 2016-02-15) - https://phabricator.wikimedia.org/T122319#1901630 (10RobH) [19:41:09] !log logstash1002 - started logstash service [19:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:04] RECOVERY - puppet last run on mw2112 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:42:24] !log ran puppet on mw2112 [19:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:43] RECOVERY - logstash process on logstash1002 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [19:44:27] 6operations, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, 10vm-requests, 5iOS-app-v5-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1901646 (10Nuria) [19:44:45] 6operations, 6Analytics-Kanban, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1746196 (10Nuria) [19:44:45] (03PS1) 10RobH: new ticket.wikimedia.org certificate (renewal replacment) [puppet] - 10https://gerrit.wikimedia.org/r/260785 [19:45:03] 6operations, 6Analytics-Engineering: kafka broker monitoring broken - https://phabricator.wikimedia.org/T122330#1901648 (10Dzahn) 3NEW [19:45:12] 6operations, 7HTTPS: ssl certificate replacement: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#1901655 (10RobH) [19:46:44] most of the graphite based monitoring is broken [19:46:51] "no valid datapoints" found [19:46:58] not new, but also not changing [19:49:36] 10Ops-Access-Requests, 6operations: add mforns, milimetric, nuria,ottomata, madhuvishy and joal to piwik-roots - https://phabricator.wikimedia.org/T122325#1901680 (10JMinor) Please let me or @BGerstle-WMF know what you need (if anything) for adding access. [19:50:49] 6operations, 6Services, 7Monitoring: various graphite based monitoring checks broken (memcached, parsoid, restbase, eventlogging..) - https://phabricator.wikimedia.org/T122332#1901681 (10Dzahn) 3NEW [19:51:19] 10Ops-Access-Requests, 6operations, 6Analytics-Backlog: add mforns, milimetric, nuria,ottomata, madhuvishy and joal to piwik-roots - https://phabricator.wikimedia.org/T122325#1901690 (10Nuria) [19:51:38] 6operations, 6Services, 7Graphite, 7Icinga, 7Monitoring: various graphite based monitoring checks broken (memcached, parsoid, restbase, eventlogging..) - https://phabricator.wikimedia.org/T122332#1901693 (10Dzahn) [19:52:25] 6operations, 6Services, 7Graphite, 7Icinga, 7Monitoring: various graphite based monitoring checks broken (memcached, parsoid, restbase, eventlogging..) - https://phabricator.wikimedia.org/T122332#1901681 (10Dzahn) [19:54:49] 7Puppet, 6Analytics-Kanban, 5Patch-For-Review: Puppet support for multiple Dashiki instances running on one server [8 pts] - https://phabricator.wikimedia.org/T120891#1901703 (10Nuria) 5Open>3Resolved [19:55:25] Jeff_Green: barium is not in site.pp but is still in Icinga. it's FR right? should it be this way? anyways, i see it because it WARNs that there are 12 zombie processes there [19:56:09] yea, it's the fr civi, reporting there [19:59:57] !log restbase1004 - puppet stopped and host key changed, what's up? [20:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:01:13] ah filippo was working on it [20:01:16] mutante: godog reimaged 1004 recently; I assume this is related to that [20:02:01] gwicke: yes, he added a message to it, just found it [20:02:02] mutante: yep it's a fundraising host, not too shocking that it has zombie processes [20:02:07] looking though... [20:02:09] I'm not aware of a reason for keeping puppet disabled, but might be good to double-check with him [20:02:09] it says he was bootstrapping cassandra [20:02:29] yeah, that's done now [20:02:40] not as a bootstrap, but close enough [20:02:55] ok, maybe it could be re-enabled but i should ask [20:03:15] 6operations, 6Analytics-Engineering: kafka broker monitoring broken - https://phabricator.wikimedia.org/T122330#1901742 (10Ottomata) a:3Ottomata [20:03:37] mutante: yup, thanks! [20:08:25] PROBLEM - puppet last run on mw1156 is CRITICAL: CRITICAL: Puppet has 1 failures [20:11:03] 6operations, 6Commons, 10Wikimedia-Media-storage, 7Monitoring: Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails - https://phabricator.wikimedia.org/T106937#1901770 (10Dzahn) Sounds good, but like we'd still need an ACK from Mark because this actually costs money. [20:18:41] 6operations, 6Analytics-Backlog, 6Analytics-Engineering: kafka broker monitoring broken - https://phabricator.wikimedia.org/T122330#1901787 (10Nuria) [20:19:32] Dear chaosmonkey, the time has come. Please deploy random SWAT (Max 8 patches) [20:19:35] Reedy: ^ like that? [20:20:02] The chaosmonkey doesn't do schedules [20:20:20] :p right.. [20:20:39] but it should have a higher chance for Friday night and weekend deploys [20:20:51] Yeah [20:20:53] And overnight [20:20:57] code freeze = bonus [20:25:48] (03PS1) 10Ottomata: Add $group_prefix parameter to kafka::server::monitoring [puppet/kafka] - 10https://gerrit.wikimedia.org/r/260791 (https://phabricator.wikimedia.org/T122330) [20:26:45] (03CR) 10Ottomata: [C: 032] Add $group_prefix parameter to kafka::server::monitoring [puppet/kafka] - 10https://gerrit.wikimedia.org/r/260791 (https://phabricator.wikimedia.org/T122330) (owner: 10Ottomata) [20:27:40] (03PS1) 10Ottomata: Use $group_prefix for main kafka cluster checks [puppet] - 10https://gerrit.wikimedia.org/r/260792 (https://phabricator.wikimedia.org/T122330) [20:29:44] (03CR) 10Ottomata: [C: 032] Use $group_prefix for main kafka cluster checks [puppet] - 10https://gerrit.wikimedia.org/r/260792 (https://phabricator.wikimedia.org/T122330) (owner: 10Ottomata) [20:31:44] RECOVERY - puppet last run on mw1156 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:32:54] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: puppet fail [20:36:34] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [20:38:11] 6operations, 6Analytics-Backlog, 6Analytics-Engineering, 5Patch-For-Review: kafka broker monitoring broken - https://phabricator.wikimedia.org/T122330#1901864 (10Ottomata) 5Open>3Resolved Thanks @dzahn! I saw those, but hadn't gotten a chance to look into them yet. This reminded me. Fixed! [20:39:28] 6operations, 6Analytics-Backlog, 6Analytics-Engineering, 5Patch-For-Review: kafka broker monitoring broken - https://phabricator.wikimedia.org/T122330#1901867 (10Dzahn) @ottomata wow, that was a really fast fix. thank! [20:43:58] 10Ops-Access-Requests, 6operations, 6Analytics-Backlog: add mforns, milimetric, nuria,ottomata, madhuvishy and joal to piwik-roots - https://phabricator.wikimedia.org/T122325#1901891 (10RobH) I'm on ops clinic duty this week, but since this is a sudo request, it will require the approval of an operations tea... [20:44:31] 10Ops-Access-Requests, 6operations, 6Analytics-Backlog: add mforns, milimetric, nuria,ottomata, madhuvishy and joal to piwik-roots - https://phabricator.wikimedia.org/T122325#1901893 (10RobH) Additionally, all of these users should have previously been setup with shell access. If any have not, they need to... [20:44:40] (03CR) 10Reedy: [C: 031] "None of these domains seem to resolve!" [dns] - 10https://gerrit.wikimedia.org/r/260706 (https://phabricator.wikimedia.org/T121914) (owner: 10Dzahn) [20:51:38] (03PS1) 10Mobrovac: EventBus: add spec-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/260799 [20:51:40] 6operations, 10Traffic, 7Availability, 5MW-1.27-release-notes, 5Patch-For-Review: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#1901925 (10aaron) a:5aaron>3None [20:52:09] (03PS3) 10Ori.livneh: Add piwik role [puppet] - 10https://gerrit.wikimedia.org/r/259601 (https://phabricator.wikimedia.org/T103577) [20:53:26] (03PS2) 10Dzahn: add several parked domains [dns] - 10https://gerrit.wikimedia.org/r/260706 (https://phabricator.wikimedia.org/T121914) [20:54:50] (03PS4) 10Ori.livneh: Add piwik role [puppet] - 10https://gerrit.wikimedia.org/r/259601 (https://phabricator.wikimedia.org/T103577) [20:54:55] (03CR) 10Ori.livneh: [C: 032 V: 032] Add piwik role [puppet] - 10https://gerrit.wikimedia.org/r/259601 (https://phabricator.wikimedia.org/T103577) (owner: 10Ori.livneh) [20:56:16] ori: ottomata: did you know about each other looking into piwik puppet? [20:56:30] nope! [20:56:37] heh , see the merge above [20:57:39] (03PS1) 10Ori.livneh: Fix-up for I136ab0a38 [puppet] - 10https://gerrit.wikimedia.org/r/260801 [20:57:53] ottomata: oh, d'oh [20:57:56] how far did you get? [20:58:08] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for I136ab0a38 [puppet] - 10https://gerrit.wikimedia.org/r/260801 (owner: 10Ori.livneh) [20:58:22] heh! about as far, i had started puppetizing a module that set up local mariadb and apache vhost [20:58:49] with a similar comment about needing to run the installer and grant mysql perms manually [20:59:06] did you add piwik package to apt? [20:59:24] i like your role better though [20:59:30] I did [20:59:36] cool [20:59:36] :) [20:59:48] gonna mark nuria's ticket as a dupe [20:59:50] what is your mysql plan? [21:00:30] manual setup. seems like the only viable way. [21:00:33] ottomata: ohhhh [21:00:35] i started going down the path of: debconf::set { [ 'mysql-server/root_password', 'mysql-server/root_password_again' ]: value => $pass, notify => Exec['dpkg-reconfigure mysql-server'], } [21:00:38] 6operations, 6Parsing-Team, 10Parsoid, 6Services: Update ruthenium to 14.04 from 12.04 - https://phabricator.wikimedia.org/T122328#1901960 (10ssastry) [21:00:44] PROBLEM - puppet last run on bohrium is CRITICAL: Timeout while attempting connection [21:00:47] manual setup fine, but where are you going to have the db? [21:00:55] locally, per jynus's preference [21:01:01] might as well add a mariadb install on the server, no? or do you think elsewhere is better [21:01:04] aye, i would say locally too [21:02:43] RECOVERY - puppet last run on bohrium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [21:05:21] 7Puppet, 6operations, 10Continuous-Integration-Infrastructure, 6Labs: compiler02.puppet3-diffs.eqiad.wmflabs out of disk space - https://phabricator.wikimedia.org/T122346#1901987 (10mobrovac) 3NEW [21:12:00] 7Puppet, 6operations, 10Continuous-Integration-Infrastructure, 6Labs: compiler02.puppet3-diffs.eqiad.wmflabs out of disk space - https://phabricator.wikimedia.org/T122346#1902019 (10Dzahn) a:3Dzahn [21:13:35] 7Puppet, 6operations, 10Continuous-Integration-Infrastructure, 6Labs: compiler02.puppet3-diffs.eqiad.wmflabs out of disk space - https://phabricator.wikimedia.org/T122346#1901987 (10Dzahn) I checked /mnt/jenkins-workspace/puppet-compiler/output# for especially large ones, and: root@compiler02:/mnt/jenkins... [21:14:12] ottomata: do you have password store set up locally? [21:14:17] have you added a password before? [21:14:31] haha, i have not since pwstore, i've only viewed [21:14:39] its on my TODOs [21:14:50] i had it working locally but then the yubikey setup messed it up somehow [21:14:51] i tried the other day and failed [21:14:54] yeah [21:15:03] 7Puppet, 6operations, 10Continuous-Integration-Infrastructure, 6Labs: compiler02.puppet3-diffs.eqiad.wmflabs out of disk space - https://phabricator.wikimedia.org/T122346#1902035 (10Dzahn) 5Open>3Resolved these really large ones must have been running on a LOT of nodes (maybe it was attempted to use th... [21:15:04] i put the credentials for piwik in iron:~otto/piwik-access.txt [21:15:16] everyhting like that is all messed up for me. i went for yubikey, el capitan, and 2 step verification all in the same week ! :( [21:15:24] ha ok [21:15:24] you should be able to compile again now [21:15:33] nuria: https://piwik.wikimedia.org/index.php [21:16:14] it has two layers of auth at the moment: wmf ldap (in apache via mod_authnz_ldap) and piwik's own [21:16:53] :) i did _not_ expect that to be up already, cool [21:17:44] 7Puppet, 6operations, 10Continuous-Integration-Infrastructure, 6Labs: compiler02.puppet3-diffs.eqiad.wmflabs out of disk space - https://phabricator.wikimedia.org/T122346#1902074 (10Dzahn) [21:18:33] 6operations, 7HTTPS: ssl certificate replacement: dumps.wikimedia.org (expires 2016-02-26) - https://phabricator.wikimedia.org/T122321#1902084 (10RobH) a:5RobH>3ArielGlenn So this is ready to roll live, however I don't want to do so without first checking with @ArielGlenn. Replacing the certificate often... [21:21:52] 6operations, 7HTTPS: ssl certificate replacement: tendril.wikimedia.org (expires 2016-02-15) - https://phabricator.wikimedia.org/T122319#1902096 (10RobH) a:5RobH>3jcrespo I don't think there would be an issue with simply replacing this certificate without a downtime window, but I'm not certain. Rather tha... [21:22:11] mutante: I never got to re-encrypting pwstore... [21:22:17] mutante: do you think I can convince you to do that? :) [21:22:34] ori: so did you just set that up in a local db manually? [21:23:30] ottomata: yes [21:23:33] create database piwik; create user 'piwik'@'localhost' identified by '...'; grant all privileges on piwik.* to 'piwik'@'localhost'; [21:23:38] then i went through the web installer [21:24:04] aye ok, i woulda puppetized the mysql server too, buuUuutu whatverrr :) [21:24:20] 6operations, 6Labs, 10netops: Create labs baremetal subnet? - https://phabricator.wikimedia.org/T121237#1902111 (10Andrew) 5Open>3declined a:3Andrew This isn't needed. [21:28:34] ottomata: you still can [21:29:19] :) [21:29:39] naw, did you use mariadbwmf10 or whatever? (its fine, i thank you for doing it :) ) [21:31:12] no, mysql-server (which defaults to mysql-server-5.5 on jessie) [21:31:35] it is in the role [21:39:12] valhallasw`cloud: for the xtools-articleinfo being slow, maybe we should tell the new maintainer, ori :P [21:39:20] don't even try [21:41:50] ori: too late :P [21:46:46] (03PS1) 10Ori.livneh: Add some explanatory comments to Piwik role [puppet] - 10https://gerrit.wikimedia.org/r/260865 [21:46:59] ottomata: ^ [21:48:01] (03CR) 10Ori.livneh: [C: 032 V: 032] Add some explanatory comments to Piwik role [puppet] - 10https://gerrit.wikimedia.org/r/260865 (owner: 10Ori.livneh) [21:58:52] Yay Piwik <3 [22:02:20] 6operations, 6Analytics-Backlog, 6Security, 6Zero: Purge > 90 days stat1002:/a/squid/archive/sampled - https://phabricator.wikimedia.org/T92342#1902263 (10csteipp) >>! In T92342#1127252, @kevinator wrote: > These logs are presently being used for QA tests on the new pageview definition. That QA should wra... [22:04:14] 10Ops-Access-Requests, 6operations, 6Analytics-Backlog: add mforns, milimetric, nuria,ottomata, madhuvishy and joal to piwik-roots - https://phabricator.wikimedia.org/T122325#1902277 (10akosiaris) This lack some more information. What kind of access is required ? starting/stopping mysql ? apache? seeing logs... [22:04:54] 6operations, 6Parsing-Team, 10Parsoid, 6Services: Update ruthenium to Ubuntu 14.04 from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1902284 (10ssastry) [22:15:35] 6operations: Security audit for tftp on Carbon - https://phabricator.wikimedia.org/T122210#1902306 (10Andrew) It is useful to have the labs vlan access tftp. I'm not entirely clear on whether it's safe to have it open to access though. [22:32:06] 6operations: Security audit for tftp on Carbon - https://phabricator.wikimedia.org/T122210#1902412 (10Krenair) [22:32:45] (03PS7) 10Mdann52: Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) [22:41:24] RoanKattouw, could you do an earlier swat today? (or at least my patch, which seems to be the only one there anyway) [22:41:46] its a minor patch https://gerrit.wikimedia.org/r/#/c/260868/ [22:50:36] ostriches, Krenair ^ [22:51:10] its getting late for me (2am already) [22:51:29] and editors keep creating graphs without version [22:51:39] is that approved to go out despite the freeze? [22:51:57] 10Ops-Access-Requests, 6operations, 6Analytics-Backlog: add mforns, milimetric, nuria,ottomata, madhuvishy and joal to piwik-roots - https://phabricator.wikimedia.org/T122325#1902508 (10Nuria) We will need to be able to see logs, do queries and start an stop the web application at least. [22:52:21] Krenair, noone objected yet - who should do it? [22:53:07] its a trivial change, and at the very worst, it will only break graph VE plugin which is not heavily used [22:53:25] today is being treated as a friday as wel [22:53:59] or was it tomorrow [22:54:08] technically the swat happens after midnight, so... [22:54:41] Krenair, yeah, but now its before the midnight, right? :)) [22:54:50] hmph [22:55:13] the problem is - every graph that will be created in the next week - i will have to go in and fix by hand [22:55:18] highly annoying process ) [22:55:48] (03PS1) 10Yuvipanda: url_downloader: Restrict to $INTERNAL only [puppet] - 10https://gerrit.wikimedia.org/r/260872 [22:56:12] yeah, but you'll think twice about what you push out next time :) [22:57:16] ori, ? that's not my code [22:57:39] yurik, to be honest it's not clear to me that this could be deployed during the normal time [22:57:52] trying to bring it forward doesn't hlp [22:57:53] help* [22:58:21] Krenair, what do you see as the problem with this patch? [22:58:41] (03PS2) 10Yuvipanda: Restrict url downloader and proxy to $INTERNAL only [puppet] - 10https://gerrit.wikimedia.org/r/260872 [22:59:32] yurik, I don't, but I don't think you have approval [22:59:52] Krenair, that's was my original question - who is the "approval master" now [23:00:20] I sent a quick email to Chad & Greg [23:00:21] i don't think we ever had the official approval processes [23:00:40] greg-g might be here? [23:00:43] I think the official approval process was 'do not do it nooooo!' heh [23:00:43] Week of December 21st[edit | edit source] [23:00:43] No normal deploys [23:00:44] High Priority SWATs only [23:00:45] "High Priority" means security and data loss. Anything else needs prior approval from at least Katie (K4-713) or Chad Horohoe (ostriches). [23:04:11] fwiw Roan is on vacation right now [23:04:32] so is greg [23:04:37] and I think ostriches might be [23:05:11] yes he is [23:05:40] Would've been nice if they could've removed themselves... I'd have cancelled the swat [23:12:22] I never remove myself from swat [23:12:35] I get pinged for it every day lmao [23:13:02] yurik: Krenair: you can always do that on-wiki in common.js. :P [23:13:21] (it would work for this patch in particular) [23:13:22] MatmaRex, in every wiki? [23:13:40] yurik: no, in the ones that actually use graphs. presumably not all do. [23:13:46] all :) [23:14:01] (i mean, that have users who insert the graphs into articles, not just theoretically able to use graphs) [23:14:02] every VE editor has this feature [23:14:10] ostriches, I sent you an email [23:15:01] MatmaRex, people experiment with it - that's why it keeps appearing in weird places, and i have to catch them :( [23:15:18] i just went through 30,000 graphs and updated them all to use proper version [23:15:23] I saw. [23:18:14] Cookie for someone who drops the wikitech deployments page link in here...on mah phone. [23:18:35] https://wikitech.wikimedia.org/wiki/Deployments [23:18:43] * ostriches owes Krenair a cookie now [23:20:19] ostriches, so what do you think? [23:20:35] * yurik gives ostriches a cookie [23:20:47] Who's doing the deploy for you if I approve? I can understand the confusion about Wednesday/Thursday...timezones suck [23:21:14] * yurik wonders if ostriches eat cookies [23:21:20] ostriches, i could even do it myself [23:21:30] or Krenair i guess ) [23:21:37] as being the only one in the SWAT list ) [23:21:53] who is still around [23:21:56] and not you [23:22:15] * ostriches is definitely not here lolol [23:23:53] yurik: approved [23:24:08] ostriches, thx! Krenair want to do it? [23:24:13] ok [23:24:16] thx ) [23:24:22] After that everyone should detach and go stuff their faces with holiday pies and such :D [23:24:39] ostriches, its 3am ;) [23:24:51] Anytime is pie time [23:25:00] https://nl.wikipedia.org/wiki/Overleg_gebruiker:Ad_Huikeshoven/Taartgrafiek [23:25:05] that's what i just got [23:25:19] as part of this discussion - https://phabricator.wikimedia.org/T14603 [23:25:25] so i guess you are right :))) [23:25:29] It's going through Jenkins now yurik [23:25:35] ostriches, go have fun, I've got this [23:25:50] thanks for popping in [23:25:57] yes, thank you both! [23:29:12] Fun heg [23:29:16] *heh [23:31:44] !log krenair@tin Synchronized php-1.27.0-wmf.9/extensions/Graph/modules/ve-graph: https://gerrit.wikimedia.org/r/#/c/260868/ (duration: 00m 31s) [23:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:31:53] yurik, ^ [23:32:06] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail [23:32:07] Krenair, thx! checking... [23:32:57] Krenair, awesome, works. thx! [23:33:11] great [23:40:41] (03CR) 10Dzahn: [C: 04-1] "$INTERNAL would not cover a server with public IP in wikimedia.org or labs systems. This is used by everything that uses APT afaik" [puppet] - 10https://gerrit.wikimedia.org/r/260872 (owner: 10Yuvipanda) [23:56:24] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures