[00:00:15] RoanKattouw: whee that was fast! K we'll be checking... [00:07:44] (03PS1) 10BBlack: text VCL: protect mobile cache from text pollution [puppet] - 10https://gerrit.wikimedia.org/r/258648 (https://phabricator.wikimedia.org/T109286) [00:08:13] (03CR) 10BBlack: [C: 032 V: 032] text VCL: protect mobile cache from text pollution [puppet] - 10https://gerrit.wikimedia.org/r/258648 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [00:08:49] RoanKattouw: K the new code is live, looks fine so far [00:17:40] RoanKattouw_away: all good! thanks so much again :) [00:17:48] PROBLEM - Labs LDAP on seaborgium is CRITICAL: Could not bind to the LDAP server [00:20:46] hm [00:20:48] that's ldap-labs.eqiad.wikimedia.org [00:21:42] and labs login just broke [00:21:52] Coren, YuviPanda, andrewbogott: ^ [00:29:20] !log restarting slapd on seaborgium with manual hack [00:29:28] RECOVERY - Labs LDAP on seaborgium is OK: LDAP OK - 0.009 seconds response time [00:32:22] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [00:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:33:41] paravoid: was it crashed or stuck? [00:34:12] (03PS1) 10Faidon Liambotis: openldap: add a 10-minute idle timeout [puppet] - 10https://gerrit.wikimedia.org/r/258652 [00:34:19] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [00:34:21] andrewbogott: it had opened too many files [00:34:28] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [00:34:42] mutante: slapd? interesting... [00:34:42] did noone add the diamond collector yet? [00:34:54] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1874796 (10JGirault) Works, I just logged in :) [00:35:31] I don’t know. Coren, YuviPanda, was one of you writing the diamond bits for ldap? [00:44:53] (03CR) 10Dzahn: [C: 04-1] "Could not find template 'maps/grants.sql.erb' at /mnt/jenkins-workspace/puppet-compiler/1477/change/src/manifests/role/maps.pp:77 on node " [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn) [00:47:59] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [01:11:04] andrewbogott: Coren said he almost had it done? [01:11:19] YuviPanda: yeah, that sounds familiar [01:11:22] I'll do it [01:13:31] * YuviPanda reads backscroll [01:13:37] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [01:16:25] ^ just as that fired, a user in #labs says that his session is lagging. [01:16:25] I don’t understand how that’s related, I’ve never seen that alert before [01:16:26] it's unrelated but maybe similar cause since they all hit NFS [01:16:26] which hits LDAP [01:16:26] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:16:26] so LDAP slows, NFS slows, knowk on effects [01:16:39] right, but the ldap alert isn’t firing [01:16:48] it just needs to be slow enough, I guess [01:16:56] I restarted slapd a couple of times [01:17:09] but... yeah, this isn't supposed to happen on just a simple restart [01:17:33] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 957096 bytes in 3.027 second response time [01:17:37] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [01:17:52] yeah, everything is back now [01:18:01] I wonder if this is nscd on labstore1001 doing something fun [01:18:44] (03PS1) 10Faidon Liambotis: openldap: add diamond collector for grabbing stats [puppet] - 10https://gerrit.wikimedia.org/r/258657 [01:19:09] (03CR) 10Faidon Liambotis: [C: 032] openldap: add a 10-minute idle timeout [puppet] - 10https://gerrit.wikimedia.org/r/258652 (owner: 10Faidon Liambotis) [01:19:40] (03CR) 10Faidon Liambotis: [C: 032] openldap: add diamond collector for grabbing stats [puppet] - 10https://gerrit.wikimedia.org/r/258657 (owner: 10Faidon Liambotis) [01:20:13] I'll leave in a bit -- if thinks get funky, I'd start by reverting the idletimeout patch [01:20:20] the monitor patch, I don't expect to have much effect [01:20:33] I’d hope not :) [01:21:46] should we make the ldap check paging? [01:21:51] I guess the tools-home check pages [01:22:06] I'll restart slapd one finale time I'm afraid [01:22:27] final* [01:25:32] https://graphite.wikimedia.org/render/?width=586&height=308&target=servers.seaborgium.openldap.conns.current [01:25:32] are you not getting that page, YuviPanda ? [01:25:32] which page, mutante [01:25:33] ldap on seaborgium? [01:25:33] tools-home [01:25:33] mutante: yes, I was talking about the direct ldap check [01:25:34] ok, you phrased that in that way that sounded like it might not work [01:25:35] and we were wondering about the gateway [01:27:18] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [01:28:11] why does that fail so often? [01:28:45] I think it just hits LDAP [01:28:50] it's been a while let me look [01:29:07] does it crash when LDAP disconnects? [01:29:08] it shouldn't [01:29:10] Dec 12 01:22:43 labstore1001 create-dbusers[12939]: ldap3.core.exceptions.LDAPSessionTerminatedByServer: session terminated by server [01:29:11] it should reconnect [01:29:12] yeah [01:29:16] let me fix that [01:29:21] <3 [01:30:27] (03PS4) 10Dzahn: (WIP) maps: move roles into autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/249059 [01:31:27] paravoid: it's also holding one connection open forever. let me make it close it at the end of each run, and also space out the runs more [01:32:01] (03CR) 10Dzahn: "@akosiaris the templates were in "./manifests/templates/" inside the role module. it did not find them there.. moved them to ./templates/m" [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn) [01:32:03] that's not necessarily a problem [01:32:23] in fact it will probably save some resources [01:32:34] for frequent readers/writes keeping the connection open is a good idea [01:32:39] (03CR) 10Alex Monk: "Added some missing newlines" [dns] - 10https://gerrit.wikimedia.org/r/258483 (https://phabricator.wikimedia.org/T120885) (owner: 10Papaul) [01:32:47] saves you from establishing a new connection every time (TCP, TLS negotiation etc.) [01:33:15] otoh an open connection does consume finite resources, so if you're only doing a query every hour, it's probably better to connect/query/close [01:33:20] it's a tradeoff essentially [01:34:52] (03PS1) 10Yuvipanda: labstore: Do not re-use connections for create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/258658 [01:35:08] (03CR) 10Dzahn: "Error: Role class role::maps not found at /mnt/jenkins-workspace/puppet-compiler/1478/change/src/manifests/site.pp:1746 on node maps-test2" [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn) [01:35:10] paravoid: hmm, so ^ [01:36:28] (03PS5) 10Dzahn: (WIP) maps: move roles into autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/249059 [01:36:34] paravoid: let me know what you think of that patch. I can also make it hit LDAP every 5 mins instead of 2min [01:36:36] nbd [01:38:44] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1874852 (10Dzahn) >>! In T118998#1874796, @JGirault wrote: > Works, I just logged in :)... [01:38:50] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1874855 (10Dzahn) 5Open>3Resolved [01:38:59] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1815162 (10Dzahn) [01:39:27] 10Ops-Access-Requests, 6operations, 6Multimedia, 5Patch-For-Review: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1874859 (10Dzahn) a:5ArielGlenn>3Dzahn [01:41:59] 6operations, 6Labs, 10netops, 5Patch-For-Review: Create labs baremetal subnet? - https://phabricator.wikimedia.org/T121237#1874862 (10Dzahn) [01:45:42] (03CR) 10Dzahn: [C: 04-2] "http://puppet-compiler.wmflabs.org/1479/" [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn) [01:47:07] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [02:01:32] 6operations, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1874887 (10Dzahn) @Lydia_Pintscher i checked our mail logs, and i can confirm there are 4 emails that have: - been sent to lydia.pintscher@wikimedia.org - on December 10 - origin is wikidata wiki... [02:08:35] 6operations, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1874898 (10Dzahn) @Hoo i can't confirm an issue on our side. i also see mail delivered to you on that day, but @wikimedia.de and both are H=mxlb.ispgateway.de which other account (gmail vs. 1and1)... [02:13:28] 10Ops-Access-Requests, 6operations, 6Multimedia, 5Patch-For-Review: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1874912 (10Dzahn) >>! In T119404#1825404, @Krenair wrote: > statistics-users is not necessary, researchers is the right grou... [02:16:12] (03PS3) 10Dzahn: add matmarex to researchers [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) (owner: 10ArielGlenn) [02:17:32] (03CR) 10jenkins-bot: [V: 04-1] add matmarex to researchers [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) (owner: 10ArielGlenn) [02:17:57] (03PS4) 10Dzahn: add matmarex to researchers [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) (owner: 10ArielGlenn) [02:18:37] 10Ops-Access-Requests, 6operations, 6Multimedia, 5Patch-For-Review: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1874918 (10Krenair) researchers gives access to both stat1003 itself and the mysql credentials. I have the same thing myself... [02:19:38] 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874919 (10GWicke) [02:20:05] PROBLEM - MariaDB Slave SQL: s1 on db2016 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table ops.event_log: Cant find record in event_log, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1057-bin.001281, end_log_pos 721053864 [02:20:44] 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1873437 (10GWicke) [02:21:04] 10Ops-Access-Requests, 6operations, 6Multimedia, 5Patch-For-Review: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1874921 (10Dzahn) Thanks. Then the researchers group has been changed at some point to include stat1003. Moving and and merg... [02:21:51] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.8) (duration: 08m 31s) [02:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:57] 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874922 (10GWicke) [02:28:25] (03PS5) 10Dzahn: add matmarex to researchers [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) (owner: 10ArielGlenn) [02:28:44] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Dec 12 02:28:44 UTC 2015 (duration 6m 53s) [02:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:59] (03CR) 10Dzahn: [C: 032] add matmarex to researchers [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) (owner: 10ArielGlenn) [02:39:01] 10Ops-Access-Requests, 6operations, 6Multimedia, 5Patch-For-Review: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1874943 (10Dzahn) 5Open>3Resolved [stat1003:~] $ id matmarex uid=2501(matmarex) gid=500(wikidev) groups=500(wikidev),714... [02:39:17] 10Ops-Access-Requests, 6operations, 6Multimedia: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1874945 (10Dzahn) [02:41:31] 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874947 (10BBlack) The task is vague, can you give specific example scenarios or something? I'm assuming services would run as regular users and not have root... [02:51:51] 10Ops-Access-Requests, 6operations, 6Multimedia: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1874950 (10Dzahn) @matmarex your user exists on stat1003.eqiad.wmnet now the mysql credentials are here, i confirmed i can read them as your use... [02:56:09] (03CR) 10Andrew Bogott: [C: 031] labstore: Do not re-use connections for create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/258658 (owner: 10Yuvipanda) [02:58:29] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: puppet fail [03:02:15] !log ran fixDefaultJsonContentPages.php --wiki=thwiktionary for T108663 [03:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:02:35] (03PS1) 10coren: Reorder modules in common-account [puppet] - 10https://gerrit.wikimedia.org/r/258663 [03:02:48] YuviPanda: andrewbogott: That one && [03:03:22] (03CR) 10Yuvipanda: [C: 031] Reorder modules in common-account [puppet] - 10https://gerrit.wikimedia.org/r/258663 (owner: 10coren) [03:04:33] (03CR) 10coren: [C: 032] "Simple fix." [puppet] - 10https://gerrit.wikimedia.org/r/258663 (owner: 10coren) [03:06:04] 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874974 (10GWicke) > The task is vague, can you give specific example scenarios or something? Anybody controlling one of these services has at least code execu... [03:15:27] g’night all! [03:16:12] 6operations, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1874977 (10hoo) >>! In T121105#1874898, @Dzahn wrote: > @Hoo i can't confirm an issue on our side. i also see mail delivered to you on that day, but @wikimedia.de and both are H=mxlb.ispgateway.de... [03:21:37] good night andrewbogott [03:26:07] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [03:26:52] I'm getting intermittent 429's... [03:30:15] 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874987 (10Krenair) Do we actually care whether a service is developed by volunteers or not? [03:31:42] Josve05a_night, are you sending lots of requests? [03:33:06] not that I know of [03:33:18] seems to be good now though... [03:37:19] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Puppet has 2 failures [03:39:02] (03PS4) 10Madhuvishy: [WIP] apache: Add role to serve static sites on multiple hosts using apache [puppet] - 10https://gerrit.wikimedia.org/r/258096 [03:54:01] 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874992 (10GWicke) > Do we actually care whether a service is developed by volunteers or not? @Krenair: I agree with you that expertise and responsiveness of m... [04:12:19] PROBLEM - puppet last run on mw2078 is CRITICAL: CRITICAL: Puppet has 1 failures [04:37:58] RECOVERY - puppet last run on mw2078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:40:40] 6operations, 10DBA, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1875004 (10JanZerebecki) >>! In T111654#1844080, @jcrespo wrote: > Of all ciphers, only a few work: Nearly all of the not working ones are EC based. Either needing an ECDSA key (you likely d... [05:04:18] PROBLEM - puppet last run on mc2006 is CRITICAL: CRITICAL: puppet fail [05:12:19] PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: Puppet has 1 failures [05:26:38] PROBLEM - puppet last run on mw1030 is CRITICAL: CRITICAL: Puppet has 1 failures [05:29:39] RECOVERY - puppet last run on mc2006 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [05:32:08] PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: puppet fail [05:37:32] (03PS1) 10Revi: Noindex for User namespace in kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) [05:37:57] RECOVERY - puppet last run on mw1018 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [05:52:09] RECOVERY - puppet last run on mw1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:57:37] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:08:59] (03CR) 10Glaisher: Noindex for User namespace in kowiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) (owner: 10Revi) [06:11:29] (03PS2) 10Revi: Noindex for User namespace in kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) [06:19:57] PROBLEM - Disk space on restbase1004 is CRITICAL: DISK CRITICAL - free space: /var 105668 MB (3% inode=99%) [06:20:10] PROBLEM - MariaDB Slave SQL: s1 on db2016 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table ops.event_log: Cant find record in event_log, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1057-bin.001281, end_log_pos 721053864 [06:21:32] (03CR) 10Glaisher: [C: 031] Noindex for User namespace in kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) (owner: 10Revi) [06:21:57] RECOVERY - Disk space on restbase1004 is OK: DISK OK [06:25:29] PROBLEM - Disk space on elastic1008 is CRITICAL: DISK CRITICAL - free space: / 630 MB (2% inode=95%) [06:27:18] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%) [06:27:28] RECOVERY - Disk space on elastic1008 is OK: DISK OK [06:30:28] PROBLEM - Disk space on elastic1012 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%) [06:30:28] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:47] PROBLEM - Disk space on elastic1026 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%) [06:30:58] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:08] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:18] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [06:31:28] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:49] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:37] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:38] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [06:56:28] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:48] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:57:08] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:57:08] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:19] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:28] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:57:58] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:07] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:08] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:58:38] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:36:39] PROBLEM - puppet last run on elastic1002 is CRITICAL: CRITICAL: Puppet has 1 failures [08:02:17] RECOVERY - puppet last run on elastic1002 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:15:26] (03PS1) 10Dereckson: Throttle rule for Hyderabad photo event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258669 (https://phabricator.wikimedia.org/T121303) [08:22:39] (03PS3) 10Dereckson: Don't index User namespace on ko.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) (owner: 10Revi) [08:22:47] (03CR) 10Dereckson: [C: 031] Don't index User namespace on ko.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) (owner: 10Revi) [08:23:46] (03PS4) 10Revi: Don't index User namespace on ko.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) [08:24:06] (03CR) 10Revi: "Fixes ~~~~ is to trigger autoclose of phabricator." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) (owner: 10Revi) [08:24:27] duh, pingspam [08:25:20] and T121301 is (strictly speaking) unrelated to T92798 [08:25:29] completely different req [08:27:04] revi: the two are about configure namespaces [08:27:34] yes but you don't add see also to all previous related bugs [08:27:59] revi: it allows to track the similar requests (any namespace-related confg change for a wiki), like on the "see also" field on Bugzilla. [08:28:33] If you want to track similar req, I think project or tracking bug. [08:29:17] revi: We try to avoid to sue tracking bugs on Phabricator. A project namespaces-for-kowiki seems a little expensive to link two or three requests. [08:29:21] I wanted to set tracking bug for kowiki but I didn't know how to make it in bz and after phab... I'm waiting for some bug (I forgot teh no.) about policy for per-wiki/per-language project [08:29:41] well [08:29:59] I have at least 3 or 5 for ns stuff of kowiki [08:30:02] (bugs) [08:30:03] Oh okay. Some other local community created a tracking bug for this use, and add any new bug as blocing the former [08:30:22] see project #commons, #wikisource [08:30:29] for example [08:31:04] I imagine the project is a good idea when a project board is needed. [08:31:23] yeah but that policy is under discussion for a year and more [08:33:15] and to counterexample of '17:29:17 revi: We try to avoid to sue tracking bugs on Phabricator. A project namespaces-for-kowiki seems a little expensive to link two or three requests.' T57342, T57914, T87528 [08:33:27] and more kowiki requests... anyway. [08:34:37] (03PS1) 10Dereckson: Set site name on sr.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258670 (https://phabricator.wikimedia.org/T121278) [08:39:42] 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1875098 (10tomasz) Removing myself and adding the other Tomasz instead :-) [09:01:59] !log move old elasticsearch logs on elastic1012 out to /var/lib/elasticsearch/log (/ is full) [09:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:02:08] RECOVERY - Disk space on elastic1012 is OK: DISK OK [09:06:51] !log move old elasticsearch logs on elastic1016 out to /var/lib/elasticsearch/log (/ is full) [09:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:07:35] mutante|away: you always use ganeti01.svc.$::site.wmnet [09:07:50] it will always point you to the correct master [09:08:16] (03PS1) 10Dereckson: Enable Wikilove on az.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258671 (https://phabricator.wikimedia.org/T119727) [09:08:41] (03CR) 10Dereckson: [C: 04-1] "Blocked on community consensus." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258671 (https://phabricator.wikimedia.org/T119727) (owner: 10Dereckson) [09:08:49] RECOVERY - Disk space on elastic1016 is OK: DISK OK [09:12:54] (03PS1) 10Dereckson: Enable NewUserMessage on ps.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258672 (https://phabricator.wikimedia.org/T121132) [09:13:42] PROBLEM - MariaDB Slave Lag: s1 on db2016 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 24953 [09:14:12] RECOVERY - MariaDB Slave SQL: s1 on db2016 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:14:16] heh [09:14:22] wat ? [09:14:33] how did it get fixed ? [09:14:38] I fixed it [09:14:43] <_joe_> akosiaris: it's a large query [09:15:01] !log move old elasticsearch logs on elastic1026 out to /var/lib/elasticsearch/log (/ is full) [09:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:15:08] <_joe_> the "seconds behind master" is a calculation made by mysql, it's not a real number [09:15:14] _joe_: I don't think so... it said slave sql runing: no [09:15:19] <_joe_> godog: we should make logrotate more aggressive? [09:15:20] no, it is something writing on the ops database, that broke replication [09:15:25] <_joe_> slave_sql_lag Seconds_Behind_Master: 24953 [09:15:27] _joe_: that ^ [09:15:35] the lag is real [09:15:50] <_joe_> jynus: the ops database? [09:15:54] _joe_: it doesn't mean it will take 24953 seconds to catch up [09:16:02] <_joe_> yup exactly [09:16:07] _joe_, you know the same I do [09:16:09] :-) [09:16:09] _joe_: not sure yet what's the right solution, it is the current indexing slowlog that's big [09:16:13] just that since the last time it successully replayed something 24953 sec have passed [09:16:21] <_joe_> godog: ah, known problem [09:16:39] RECOVERY - Disk space on elastic1026 is OK: DISK OK [09:16:39] jynus: ok, so I assume some weird query ? [09:16:41] <_joe_> godog: I think david had a ticket tracking the upstream issue [09:16:43] yes, it should take less than that, that is the current delay [09:16:50] akosiaris, I have to investigate an fix it [09:17:21] ok. what was the temporary fix ? sql slave skip counter =1 ? [09:17:28] <_joe_> it depends [09:17:30] in this case [09:17:45] where there was a write to a non existent table on the slave [09:17:52] lol [09:17:53] and a table that is not useful [09:18:03] yes, skip was ok [09:18:10] normally that is not ok [09:18:10] good to know it was what I would have done [09:18:19] do not do it generally [09:18:21] PROBLEM - MariaDB Slave Lag: s1 on db2034 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 23745 [09:18:27] that breaks replication even more [09:18:27] oh I know better than that [09:18:39] <_joe_> jynus: how do you know that table is not useful? [09:18:40] I usually look very carefully at the query before doing that [09:18:53] so, db2034 is the exact same thing I assume ? [09:19:09] I'm ackin all those [09:19:26] when it broke on db2016, it broke on all of codfw [09:19:43] _joe_, the ops database is not real data, AFAIK [09:19:53] <_joe_> lol :) [09:20:01] it is supposed to be a local database in the past for query profiling [09:20:16] the fact that it is being replicated is a flaw in the logic [09:20:22] RECOVERY - MariaDB Slave Lag: s1 on db2034 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [09:20:30] _joe_: heh, do you know offhand how to search (ah ah) for it? [09:20:31] and I have to found why it was created [09:22:43] <_joe_> godog: nope [09:22:46] <_joe_> godog: I can try [09:24:51] (03CR) 10Alexandros Kosiaris: [C: 04-2] "This is failing due to the role:: namespace being handled both by the module and the import "manifests/role/*.pp" statement if this patch " [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn) [09:25:15] <_joe_> godog: https://phabricator.wikimedia.org/T117181 [09:25:41] this is an important bug, that would have broken the wikis, if it had been eqiad [09:26:42] <_joe_> akosiaris: oh so the original phab issue on this is wrong? [09:27:16] _joe_: sorry ? not following ... [09:27:47] <_joe_> the Ps you just commented on [09:27:57] there's a phab for that ? [09:28:39] <_joe_> yup, gimme a sec [09:28:46] _joe_: ah thanks! I'll piggyback on that [09:31:48] 6operations, 3Discovery-Cirrus-Sprint, 5Patch-For-Review: Elasticsearch index indexing slow log generates too much data - https://phabricator.wikimedia.org/T117181#1875146 (10fgiunchedi) [09:32:58] so, Connection: Close doesn't seem to imply anything to etherpad ... [09:33:43] 6operations, 6Commons, 10Wikimedia-Media-storage: image magick stripping colour profile of PNG files [probably regression] - https://phabricator.wikimedia.org/T113123#1875147 (10Nemo_bis) [09:36:52] PROBLEM - MariaDB Slave Lag: s1 on db2034 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 17328 [09:36:56] PROBLEM - MariaDB Slave Lag: s1 on db2055 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 17328 [09:37:42] PROBLEM - MariaDB Slave Lag: s1 on db2069 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 17007 [09:40:55] (03CR) 10Aklapper: ""not ready for merge yet, a bug to be worked out" - if that is still true, feel encouraged to add a [WIP] prefix to the patch summary so t" [dumps/html/deploy] - 10https://gerrit.wikimedia.org/r/204964 (https://phabricator.wikimedia.org/T94457) (owner: 10GWicke) [09:42:39] PROBLEM - puppet last run on mw2149 is CRITICAL: CRITICAL: puppet fail [09:43:08] hi [09:43:10] more LDAP woes [09:44:03] PROBLEM - MariaDB Slave Lag: s1 on db2048 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 14564 [09:44:39] !log recreated events on db1057 with sql_bin_log = 0 and restarted replication on db2016 [09:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:45:22] RECOVERY - MariaDB Slave Lag: s1 on db2034 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [09:45:26] RECOVERY - MariaDB Slave Lag: s1 on db2055 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [09:45:28] so here is the thing [09:46:12] RECOVERY - MariaDB Slave Lag: s1 on db2069 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [09:46:21] RECOVERY - MariaDB Slave Lag: s1 on db2048 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [09:46:38] events, that may be actually useful, but I would prefer to have on an agent and puppetized will kill long-running queries [09:46:47] and idling connections [09:46:58] that creates a log to a table [09:47:03] and that table is purged [09:47:09] but that was replicated [09:47:49] meaning that at some point table was purged, but that made replication explode [09:48:12] I recereated the events so that they do not write to the slave's binary log [09:48:27] but all servers should be checked for these events [09:48:48] (03CR) 10Filippo Giunchedi: "LGTM, do you have this running already somewhere to peek at a sample of metrics?" [puppet] - 10https://gerrit.wikimedia.org/r/258491 (owner: 10Alexandros Kosiaris) [09:49:42] PROBLEM - MariaDB Slave SQL: s1 on db2016 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table ops.event_log: Cant find record in event_log, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1057-bin.001282, end_log_pos 300174534 [09:50:32] I wil have to skip, however, errors like ^ created hours ago [09:51:07] do not skip errors on tables outside of the ops database, which is not very useful [09:52:56] better that we caught this now than before it was on full production [10:00:59] !log elastic in eqiad: disabling TRACE indexing slowlog for urwiki_content [10:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:02:22] RECOVERY - MariaDB Slave SQL: s1 on db2016 is OK: OK slave_sql_state Slave_SQL_Running: Yes [10:10:08] RECOVERY - puppet last run on mw2149 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [10:12:35] !log bounce nslcd on tools-submit and stop puppet [10:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:15:00] (03CR) 10Aklapper: "@ArielGlenn: This patch has seen no activity for six months. How to get this reviewed/merged or some progress here so this won't just rot?" [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219845 (owner: 10ArielGlenn) [10:15:05] (03CR) 10Aklapper: "@ArielGlenn: This patch has seen no activity for six months. How to get this reviewed/merged or some progress here so this won't just rot?" [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219852 (owner: 10ArielGlenn) [10:15:10] (03CR) 10Aklapper: "@ArielGlenn: This patch has seen no activity for six months. How to get this reviewed/merged or some progress here so this won't just rot?" [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219841 (https://phabricator.wikimedia.org/T103013) (owner: 10ArielGlenn) [10:32:59] 6operations, 3Discovery-Cirrus-Sprint, 5Patch-For-Review: Elasticsearch index indexing slow log generates too much data - https://phabricator.wikimedia.org/T117181#1875217 (10dcausse) Thanks, Joe submitted a puppet patch few month ago but we haven't restarted the nodes yet. The workaround today is to disable... [10:47:34] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Logging: Error: 2013 Lost connection to MySQL server during query on IndexPager::buildQueryInfo (LogPager) - https://phabricator.wikimedia.org/T121306#1875234 (10Josve05a) [10:48:42] RECOVERY - MariaDB Slave Lag: s1 on db2016 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [10:49:47] that is the end of the issue, it will take hours to confirm that it has been fixed permanently [10:51:10] thanks [10:52:48] these pages are a bit too verbose [10:53:11] a page for each codfw slave that's lagging behind is probably too much :) [10:57:26] well, it is not trivial- if the master lags, all lag, but they can lag individually to [10:57:35] problem is that topology is dynamic [10:57:42] and not puppet-dependent [10:57:57] so it is not trivial to create nagios dependencies [10:59:09] I suppose a check could be created where not become critical if its master is critical, but again that is not dynamic [11:00:04] BTW, I have not received a single SMS [11:00:11] I got them all.. [11:00:44] I woke up like 4 times :P [11:01:09] paravoid sleeps? [11:01:13] TIL [11:03:36] ldap connections are holding steady, btw: https://graphite.wikimedia.org/render/?width=586&height=308&target=servers.serpens.openldap.conns.current&target=servers.seaborgium.openldap.conns.current [11:05:21] paravoid: lots of errors from nslcd in syslogs tho [11:16:56] I think the best way is to not page if a server is depooled [11:17:30] but that requires to pull from git on every check [11:19:31] but it would be nice to integrate it with systemd - "this server is pooled, are you sure you want to stop it?" [13:26:37] PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: puppet fail [13:52:09] RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [13:57:35] (03CR) 10Luke081515: [C: 031] Throttle rule for Hyderabad photo event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258669 (https://phabricator.wikimedia.org/T121303) (owner: 10Dereckson) [13:58:18] Can a member of the SWAT Team look for that patch? He needs deployment till tomorow [14:05:45] SWAT on that level isn't really ops territory :) [14:44:48] PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: puppet fail [14:57:16] ori: You have done a change in syntax highlight this month: https://gerrit.wikimedia.org/r/#/c/256577/ [14:57:45] After setting up a wiki with this change, I get the error „Class undefined: Symfony\Component\Process\ProcessBuilder” [15:12:57] RECOVERY - puppet last run on mw2102 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:18:03] Run composer [15:22:10] Run, Composer, run! [16:00:35] (03PS3) 10Alexandros Kosiaris: Remove empty manifests/role/openldap.pp [puppet] - 10https://gerrit.wikimedia.org/r/258486 [16:00:57] (03CR) 10Alexandros Kosiaris: [V: 032] Remove empty manifests/role/openldap.pp [puppet] - 10https://gerrit.wikimedia.org/r/258486 (owner: 10Alexandros Kosiaris) [16:11:47] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:15:38] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:23:43] (03PS2) 10Alexandros Kosiaris: varnish: allow etherpad to use websockets [puppet] - 10https://gerrit.wikimedia.org/r/258439 [16:53:39] PROBLEM - puppet last run on db2001 is CRITICAL: CRITICAL: puppet fail [16:59:34] (03CR) 10Alex Monk: [C: 032] Throttle rule for Hyderabad photo event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258669 (https://phabricator.wikimedia.org/T121303) (owner: 10Dereckson) [16:59:57] (03Merged) 10jenkins-bot: Throttle rule for Hyderabad photo event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258669 (https://phabricator.wikimedia.org/T121303) (owner: 10Dereckson) [17:01:13] !log krenair@tin Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/258669/ (duration: 00m 29s) [17:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:19:19] RECOVERY - puppet last run on db2001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:20:25] (03CR) 10Florianschmidtwelzow: [C: 031] Don't index User namespace on ko.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) (owner: 10Revi) [17:56:38] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: puppet fail [18:10:48] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: puppet fail [18:24:17] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:38:28] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [19:11:48] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [19:30:38] !log Restarted Zuul [19:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:39:28] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:12:20] (03PS1) 10Ori.livneh: Mark rdb1006 as a slave of rdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/258691 (https://phabricator.wikimedia.org/T119543) [20:12:52] (03PS2) 10Ori.livneh: Mark rdb1006 as a slave of rdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/258691 (https://phabricator.wikimedia.org/T119543) [20:13:00] (03CR) 10Ori.livneh: [C: 032 V: 032] Mark rdb1006 as a slave of rdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/258691 (https://phabricator.wikimedia.org/T119543) (owner: 10Ori.livneh) [20:19:44] (03PS1) 10Ori.livneh: Ensure /srv/redis is created before starting redis [puppet] - 10https://gerrit.wikimedia.org/r/258692 [20:19:58] PROBLEM - puppet last run on mw2093 is CRITICAL: CRITICAL: puppet fail [20:21:04] ori I like your puppet magic: <| |> [20:21:35] (03CR) 10jenkins-bot: [V: 04-1] Ensure /srv/redis is created before starting redis [puppet] - 10https://gerrit.wikimedia.org/r/258692 (owner: 10Ori.livneh) [20:22:34] i'm actually a bit mystified [20:23:41] that diamond operator is probably way over my league [20:23:55] I will get another glass of Bourgogne instead [20:24:10] it's not a diamond, it's a spaceship! :) [20:24:11] https://docs.puppetlabs.com/puppet/latest/reference/lang_collectors.html [20:24:16] but that sounds like a better plan [20:24:49] (03PS2) 10Ori.livneh: Ensure /srv/redis is created before starting redis [puppet] - 10https://gerrit.wikimedia.org/r/258692 [20:25:25] I wonder whether that would work on labs [20:25:32] I don't think puppet has collection enabled on labs [20:25:49] maybe puppetmaster::self does though [20:25:57] (03CR) 10Ori.livneh: [C: 032] Ensure /srv/redis is created before starting redis [puppet] - 10https://gerrit.wikimedia.org/r/258692 (owner: 10Ori.livneh) [20:27:15] i don't think it's possible to disable resource collectors [20:27:55] at least deployment-redis{01,02} are passing puppet [20:28:21] yeah i'm just checking on deployment-redis02 [20:30:28] * hashar refills with more Bourgogne [20:33:48] (03PS1) 10Ori.livneh: redis: explicitly declare 'daemonize no' for each instance [puppet] - 10https://gerrit.wikimedia.org/r/258693 [20:35:44] (03CR) 10Ori.livneh: [C: 032] redis: explicitly declare 'daemonize no' for each instance [puppet] - 10https://gerrit.wikimedia.org/r/258693 (owner: 10Ori.livneh) [20:43:08] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: http status 500 [20:43:18] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: http status 500 [20:43:54] doh [20:44:05] (03PS1) 10Ori.livneh: Add rdb100[5-6] to job queue configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258695 [20:44:10] ori: could it have impacted ocg ? [20:44:44] i'll check [20:45:08] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 147 msg: ocg_render_job_queue 0 msg [20:45:17] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 170 msg: ocg_render_job_queue 0 msg [20:45:50] it looks like ocg crashes if its redis server is unavailable even briefly [20:45:56] it then recovers [20:45:58] still pretty shitty [20:46:09] anyways, fine now, didn't need to intervene [20:46:24] we had a spike of errors [20:46:43] even on wikis with stuff like "Could not insert 1 enqueue job(s)." [20:46:46] (03PS1) 10Ori.livneh: Add rdb100[5-6] to job runner configuration [puppet] - 10https://gerrit.wikimedia.org/r/258696 (https://phabricator.wikimedia.org/T119543) [20:46:49] guess MediaWiki doesn't like it either [20:47:38] RECOVERY - puppet last run on mw2093 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:47:54] * ori doesn't see a spike on https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22)&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22) [20:48:49] looks like it was on jobs [20:48:53] (03CR) 10Ori.livneh: [C: 032] Add rdb100[5-6] to job runner configuration [puppet] - 10https://gerrit.wikimedia.org/r/258696 (https://phabricator.wikimedia.org/T119543) (owner: 10Ori.livneh) [20:48:57] guess they will just be retried later on [20:50:24] (03PS1) 10Ori.livneh: Fix-up for I84fe3a2638 [puppet] - 10https://gerrit.wikimedia.org/r/258697 [20:50:35] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for I84fe3a2638 [puppet] - 10https://gerrit.wikimedia.org/r/258697 (owner: 10Ori.livneh) [20:52:51] (03PS2) 10Ori.livneh: Add rdb100[5-6] to job queue configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258695 [20:54:33] (03PS3) 10Ori.livneh: Add rdb100[5-6] to job queue configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258695 [21:00:06] (03CR) 10Ori.livneh: [C: 032] Add rdb100[5-6] to job queue configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258695 (owner: 10Ori.livneh) [21:00:39] (03Merged) 10jenkins-bot: Add rdb100[5-6] to job queue configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258695 (owner: 10Ori.livneh) [21:02:57] !log ori@tin Synchronized wmf-config/jobqueue-eqiad.php: I53f13a159: Add rdb100[5-6] to job queue configuration (duration: 00m 31s) [21:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:03:44] 6operations, 10ops-eqiad, 5Patch-For-Review: rack/setup/deploy rdb1005 & rdb1006 - https://phabricator.wikimedia.org/T119543#1875793 (10ori) 5Open>3Resolved [21:06:29] have a good afternoon! [21:09:28] Hi [21:09:45] Purging isnt working - https://en.wikisource.org/w/index.php?title=Index:A_Voyage_in_Space_%281913%29.djvu&redirects=1 [21:10:02] And there a pages shown in red that should be in Yellow [21:10:08] I'm starting to get this after various edits and admin actions: [6067fa51] 2015-12-12 21:07:35: Fatal exception of type "MWException" [21:10:09] PROBLEM - puppet last run on pybal-test2003 is CRITICAL: CRITICAL: Puppet has 2 failures [21:10:24] which suggests page status and link table aren't updating. [21:10:58] PROBLEM - puppet last run on mc2011 is CRITICAL: CRITICAL: Puppet has 1 failures [21:10:58] Was something updated recently, or perhpas "something may have gone wrong" [21:11:58] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 970 bytes in 0.061 second response time [21:12:17] should sort itself out in a moment [21:14:47] PROBLEM - puppet last run on rdb1008 is CRITICAL: CRITICAL: Puppet has 3 failures [21:17:49] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1470 bytes in 0.325 second response time [21:17:58] PROBLEM - puppet last run on mc2012 is CRITICAL: CRITICAL: Puppet has 1 failures [21:18:19] PROBLEM - puppet last run on mc2002 is CRITICAL: CRITICAL: Puppet has 1 failures [21:18:19] PROBLEM - puppet last run on mc2009 is CRITICAL: CRITICAL: Puppet has 1 failures [21:18:27] PROBLEM - puppet last run on mc2001 is CRITICAL: CRITICAL: puppet fail [21:18:57] PROBLEM - puppet last run on rdb1007 is CRITICAL: CRITICAL: puppet fail [21:20:08] PROBLEM - puppet last run on mc2016 is CRITICAL: CRITICAL: Puppet has 1 failures [21:20:27] PROBLEM - puppet last run on rdb2002 is CRITICAL: CRITICAL: Puppet has 4 failures [21:22:25] (03PS1) 10Ori.livneh: Fix-up for Ibc46006e17: daemonize [puppet] - 10https://gerrit.wikimedia.org/r/258703 [21:22:28] PROBLEM - puppet last run on mc2014 is CRITICAL: CRITICAL: Puppet has 1 failures [21:22:38] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for Ibc46006e17: daemonize [puppet] - 10https://gerrit.wikimedia.org/r/258703 (owner: 10Ori.livneh) [21:23:07] hashar: I was told using resource collectors on labs is a security risk... never checked if and to what extend that is true. [21:23:24] ori: we also have 2000 errors per seconds related to rdb1007 such as "Lua script error on server "rdb1007.eqiad.wmnet:6379": LOADING Redis is loading the dataset in memory" [21:23:46] yeah, that was when it was restarted. it should be ok now [21:23:46] jzerebecki: i think the issue is that the resources from your project are sent to the labs wide puppetmaster [21:23:58] jzerebecki: thus your project resources could be reached by other projects [21:25:10] ori: comes in spike every 1 minutes and 30 seconds [21:25:24] you're confusing https://docs.puppetlabs.com/puppet/latest/reference/lang_exported.html with resource collectors [21:25:32] on all redis instances of rdb1007 [21:25:43] oh [21:26:00] !log running fixDefaultJsonContentPages.php on all wikis (T108663) [21:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:26:10] they both have <<| |>> [21:26:49] no, one is <| |> [21:27:38] ah yes I confused those [21:27:59] *sigh* wikidata dispatching broke [21:28:24] maybe due to job runners redis? [21:28:53] yea [21:29:08] PROBLEM - puppet last run on mc2004 is CRITICAL: CRITICAL: Puppet has 1 failures [21:29:15] anyway. Gotta sleep [21:29:21] and recover from the week [21:29:33] have a good weekend [21:29:43] jzerebecki: I will get the Jenkins tmpfs issue sorted out eventually :( [21:30:10] jzerebecki: pretty sure it is the mwext-selenium-mw job being the cause because of SKIP_TMPFS. Will try to reproduce on Monday [21:30:16] have a good week-end! [21:30:18] RECOVERY - puppet last run on mc2001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [21:30:23] hashar: maybe my patch would fix that [21:30:30] yeha [21:31:08] PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: Puppet has 1 failures [21:31:57] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures [21:32:18] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [21:32:25] jzerebecki: i will probably get rid of the skip tmpfs entirely . We will see [21:34:19] PROBLEM - puppet last run on rdb2003 is CRITICAL: CRITICAL: Puppet has 4 failures [21:36:28] PROBLEM - puppet last run on mc2008 is CRITICAL: CRITICAL: Puppet has 1 failures [21:36:48] RECOVERY - puppet last run on rdb1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:37:08] RECOVERY - puppet last run on mc2004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [21:40:09] RECOVERY - puppet last run on mc2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:40:17] PROBLEM - puppet last run on rdb2004 is CRITICAL: CRITICAL: Puppet has 3 failures [21:40:27] RECOVERY - puppet last run on mc2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:40:39] RECOVERY - puppet last run on mc2011 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [21:40:48] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [21:41:57] RECOVERY - puppet last run on mc2016 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [21:43:38] PROBLEM - puppet last run on mc2013 is CRITICAL: CRITICAL: Puppet has 1 failures [21:44:38] PROBLEM - puppet last run on rdb1007 is CRITICAL: CRITICAL: puppet fail [21:44:47] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:44:57] RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [21:44:58] PROBLEM - puppet last run on rdb2001 is CRITICAL: CRITICAL: Puppet has 3 failures [21:45:37] RECOVERY - puppet last run on mc2013 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:45:38] RECOVERY - puppet last run on mc2012 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [21:45:39] RECOVERY - puppet last run on pybal-test2003 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [21:45:59] RECOVERY - puppet last run on mc2002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [21:45:59] PROBLEM - puppet last run on mc2009 is CRITICAL: CRITICAL: Puppet has 1 failures [21:46:17] RECOVERY - puppet last run on mc2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:46:17] PROBLEM - puppet last run on mc2014 is CRITICAL: CRITICAL: Puppet has 1 failures [21:46:18] RECOVERY - puppet last run on rdb1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:46:29] RECOVERY - puppet last run on rdb1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:46:58] PROBLEM - puppet last run on mc2003 is CRITICAL: CRITICAL: Puppet has 1 failures [21:47:58] RECOVERY - puppet last run on mc2009 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [21:47:59] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [21:48:09] RECOVERY - puppet last run on mc2014 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [21:48:58] RECOVERY - puppet last run on mc2003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [21:49:37] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:49:59] RECOVERY - puppet last run on rdb2004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:49:59] RECOVERY - puppet last run on rdb2003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [21:50:57] RECOVERY - puppet last run on rdb2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:55:31] (03PS1) 10Ori.livneh: Use correct partition names for rdb4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258709 [21:55:56] (03CR) 10Ori.livneh: [C: 032] Use correct partition names for rdb4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258709 (owner: 10Ori.livneh) [21:56:26] (03Merged) 10jenkins-bot: Use correct partition names for rdb4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258709 (owner: 10Ori.livneh) [21:57:59] RECOVERY - puppet last run on rdb2002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:58:04] (03PS1) 10Aaron Schulz: Fixed duplicate rdb3 keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258710 [21:58:47] (03Abandoned) 10Ori.livneh: Fixed duplicate rdb3 keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258710 (owner: 10Aaron Schulz) [22:01:14] (03CR) 10Ori.livneh: "the 'ń' broke puppet on rutherfordium: Error: Could not convert change 'comment' to string: incompatible character encodings: UTF-8 and AS" [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) (owner: 10ArielGlenn) [22:03:48] (03PS1) 10Ori.livneh: ASCII-fi name in admin/data/data.tml [puppet] - 10https://gerrit.wikimedia.org/r/258713 [22:04:08] (03PS2) 10Ori.livneh: ASCII-fi name in admin/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/258713 [22:04:37] (03CR) 10Ori.livneh: [C: 032 V: 032] ASCII-fi name in admin/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/258713 (owner: 10Ori.livneh) [22:06:38] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [22:12:56] (03PS1) 10Ori.livneh: Use rdb1005 as the primary job queue aggregator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258714 [22:31:37] PROBLEM - cassandra CQL 10.64.0.223:9042 on restbase1007 is CRITICAL: Connection refused [23:04:32] (03CR) 10Alex Monk: "Isn't the correct fix to make sure it expects utf-8 instead?" [puppet] - 10https://gerrit.wikimedia.org/r/258713 (owner: 10Ori.livneh) [23:28:58] (03PS1) 10Yuvipanda: diamond: Fix puppet failures on first run [puppet] - 10https://gerrit.wikimedia.org/r/258719 [23:29:56] (03PS2) 10Ori.livneh: diamond: Fix puppet failures on first run [puppet] - 10https://gerrit.wikimedia.org/r/258719 (owner: 10Yuvipanda) [23:30:50] (03CR) 10Ori.livneh: [C: 04-1] "This will break, because init.pp includes some collectors" [puppet] - 10https://gerrit.wikimedia.org/r/258719 (owner: 10Yuvipanda) [23:35:58] (03PS3) 10Ori.livneh: diamond: Fix puppet failures on first run [puppet] - 10https://gerrit.wikimedia.org/r/258719 (owner: 10Yuvipanda) [23:38:32] ori: thanks :D [23:42:04] (03CR) 10Ori.livneh: [C: 031] "catalog compiler confirms it's a no-op on already-provisioned hosts (https://puppet-compiler.wmflabs.org/1485/)" [puppet] - 10https://gerrit.wikimedia.org/r/258719 (owner: 10Yuvipanda) [23:58:32] (03CR) 10Yuvipanda: [C: 032] diamond: Fix puppet failures on first run [puppet] - 10https://gerrit.wikimedia.org/r/258719 (owner: 10Yuvipanda)