[00:00:16] (03PS2) 10Dzahn: hadoop: fix some lint issues [puppet] - 10https://gerrit.wikimedia.org/r/241315 [00:00:59] Does anyone out there have an iPhone to smoke test banners on prod? https://en.wikipedia.org/wiki/Main_Page?country=IL [00:04:31] (03PS1) 10BBlack: define conftool-data for cluster=phabricator,service=git-ssh [puppet] - 10https://gerrit.wikimedia.org/r/242014 (https://phabricator.wikimedia.org/T100519) [00:05:14] (03CR) 10Dzahn: [C: 032] "no diff on master and a client: http://puppet-compiler.wmflabs.org/926/" [puppet] - 10https://gerrit.wikimedia.org/r/241315 (owner: 10Dzahn) [00:07:24] (03PS1) 10Yuvipanda: ldap: Make it clearer where ldap.conf comes from [puppet] - 10https://gerrit.wikimedia.org/r/242015 [00:08:19] (03CR) 10Dzahn: [C: 04-1] "please address qchris' comments" [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) (owner: 10Ricordisamoa) [00:11:33] (03CR) 10Dzahn: "@ArielGlenn, you said you kept this open for a specific reason. what was it?" [dns] - 10https://gerrit.wikimedia.org/r/120999 (owner: 10ArielGlenn) [00:15:29] (03PS1) 10Yuvipanda: ldap: Provide ldap credentials and servernames in YAML format [puppet] - 10https://gerrit.wikimedia.org/r/242017 (https://phabricator.wikimedia.org/T114063) [00:17:35] (03PS2) 10BBlack: define conftool-data for cluster=phabricator,service=git-ssh [puppet] - 10https://gerrit.wikimedia.org/r/242014 (https://phabricator.wikimedia.org/T100519) [00:17:42] (03CR) 10BBlack: [C: 032 V: 032] define conftool-data for cluster=phabricator,service=git-ssh [puppet] - 10https://gerrit.wikimedia.org/r/242014 (https://phabricator.wikimedia.org/T100519) (owner: 10BBlack) [00:19:01] Krenair: is user_property indexed? [00:20:03] legoktm, yes [00:20:15] ok, then that query seems fine [00:20:37] legoktm, look at "show indexes in user_properties" :) [00:21:52] (03PS2) 10Dzahn: switch 'stopsurveillance' redirect to policy site [puppet] - 10https://gerrit.wikimedia.org/r/241746 (https://phabricator.wikimedia.org/T97341) [00:22:31] (03CR) 10Dzahn: [C: 032] "per https://wikimediafoundation.org/wiki/User:TLi_%28WMF%29" [puppet] - 10https://gerrit.wikimedia.org/r/241746 (https://phabricator.wikimedia.org/T97341) (owner: 10Dzahn) [00:29:00] 6operations, 10Wikimedia-Apache-configuration, 5Patch-For-Review: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1683693 (10Dzahn) @tli hello, i uploaded a change to the cluster configuration to change this redirect as requested. the change should now gradually become active within... [00:30:56] (03PS1) 10BBlack: Fix 'service' value in LVS hieradata for git-ssh [puppet] - 10https://gerrit.wikimedia.org/r/242020 (https://phabricator.wikimedia.org/T100519) [00:31:20] (03CR) 10BBlack: [C: 032 V: 032] Fix 'service' value in LVS hieradata for git-ssh [puppet] - 10https://gerrit.wikimedia.org/r/242020 (https://phabricator.wikimedia.org/T100519) (owner: 10BBlack) [00:33:03] RECOVERY - Confd template for /etc/pybal/pools/git-ssh on lvs1009 is OK: No errors detected [00:35:33] RECOVERY - Confd template for /etc/pybal/pools/git-ssh on lvs1003 is OK: No errors detected [00:35:53] RECOVERY - Confd template for /etc/pybal/pools/git-ssh on lvs1006 is OK: No errors detected [00:38:33] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2008_v6 [00:39:22] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Create a package for python-pykafka for ubuntu precise and debian sid - https://phabricator.wikimedia.org/T109567#1683710 (10Ottomata) FYI, python-pykafka doesn't work for precise because The following packages have unmet dependencies: python-pykafk... [00:40:24] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [00:41:26] (03PS2) 10Yuvipanda: ldap: Make it clearer where ldap.conf comes from [puppet] - 10https://gerrit.wikimedia.org/r/242015 [00:41:28] (03PS2) 10Yuvipanda: ldap: Clean out some ensure => absented files [puppet] - 10https://gerrit.wikimedia.org/r/242013 [00:41:30] (03PS2) 10Yuvipanda: ldap: mwclient no longer needed for ldap client [puppet] - 10https://gerrit.wikimedia.org/r/242012 [00:41:32] (03PS2) 10Yuvipanda: ldap: Provide ldap credentials and servernames in YAML format [puppet] - 10https://gerrit.wikimedia.org/r/242017 (https://phabricator.wikimedia.org/T114063) [00:41:45] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: mwclient no longer needed for ldap client [puppet] - 10https://gerrit.wikimedia.org/r/242012 (owner: 10Yuvipanda) [00:41:57] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Clean out some ensure => absented files [puppet] - 10https://gerrit.wikimedia.org/r/242013 (owner: 10Yuvipanda) [00:42:09] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Make it clearer where ldap.conf comes from [puppet] - 10https://gerrit.wikimedia.org/r/242015 (owner: 10Yuvipanda) [00:42:20] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Provide ldap credentials and servernames in YAML format [puppet] - 10https://gerrit.wikimedia.org/r/242017 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [00:43:06] (03CR) 10Dzahn: move misc/labsdebrepo out of misc to module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [00:43:08] (03CR) 10Tim Landscheidt: "I don't like the idea of having two files with the same information, one being standardized ldap.conf(5) and the other just for being able" [puppet] - 10https://gerrit.wikimedia.org/r/242017 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [00:43:38] (03CR) 10Ottomata: [WIP] Consume EventLogging validation logs from Logstash (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) (owner: 10Mforns) [00:45:33] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2024_v6 [00:46:06] (03CR) 10Yuvipanda: "I don't like copy pasting code around, and I also like that scripts be standalone or depend only on packages / libraries, rather than file" [puppet] - 10https://gerrit.wikimedia.org/r/242017 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [00:48:38] (03PS1) 10Yuvipanda: ldap: Fix string interpolation issues [puppet] - 10https://gerrit.wikimedia.org/r/242025 [00:49:13] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [00:49:26] (03CR) 10Yuvipanda: [C: 032] ldap: Fix string interpolation issues [puppet] - 10https://gerrit.wikimedia.org/r/242025 (owner: 10Yuvipanda) [00:51:14] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2014_v6 [00:52:04] (03CR) 10Dzahn: move misc/labsdebrepo out of misc to module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [00:52:31] (03PS13) 10Dzahn: move misc/labsdebrepo out of misc to module [puppet] - 10https://gerrit.wikimedia.org/r/194796 [00:53:03] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [00:56:42] !log tstarling@tin Synchronized php-1.26wmf24/extensions/ParsoidBatchAPI: Fix fatal error I77fd7e8 (duration: 00m 17s) [00:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:57:15] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2024_v6 [00:59:05] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 60 ESP OK [01:04:35] (03PS1) 10Yuvipanda: ldap: Make sure the correct ldap password is used [puppet] - 10https://gerrit.wikimedia.org/r/242026 [01:05:48] (03CR) 10Yuvipanda: [C: 032] ldap: Make sure the correct ldap password is used [puppet] - 10https://gerrit.wikimedia.org/r/242026 (owner: 10Yuvipanda) [01:06:48] (03PS1) 10Dzahn: cdh: lint fixes - indentation [puppet/cdh] - 10https://gerrit.wikimedia.org/r/242031 [01:07:15] PROBLEM - DPKG on pybal-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:25] PROBLEM - puppet last run on pybal-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:34] PROBLEM - Disk space on pybal-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:43] PROBLEM - salt-minion processes on pybal-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:53] PROBLEM - RAID on pybal-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:53] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:04] PROBLEM - configured eth on pybal-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:14] PROBLEM - dhclient process on pybal-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:09:33] PROBLEM - salt-minion processes on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:09:34] PROBLEM - configured eth on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:09:34] PROBLEM - Hadoop JournalNode on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:09:43] PROBLEM - DPKG on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:10:56] hrmm [01:11:02] seems to be oom, cannot ssh, checking mgmt on analytics1035 [01:12:38] yea, i get root login via serial but puttin gin root and awaiting password it just sits... [01:12:59] !log powercycling analytics1035, seems oom, cannot login via ssh or serial console [01:13:04] RECOVERY - salt-minion processes on analytics1035 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [01:13:04] RECOVERY - Hadoop JournalNode on analytics1035 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode [01:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:13:08] RECOVERY - configured eth on analytics1035 is OK: OK - interfaces up [01:13:11] ... [01:13:13] bah.... [01:13:13] RECOVERY - DPKG on analytics1035 is OK: All packages OK [01:13:18] well, i hadnt done it yet but wtf.. [01:13:23] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [01:13:38] !log didnt powercycle analytics1035 yet, it recovered on its own. [01:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:15:44] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2021_v6 [01:17:34] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [01:18:47] (03PS1) 10Yuvipanda: ldap: Be sligltly less inaccurate about 'username' [puppet] - 10https://gerrit.wikimedia.org/r/242032 [01:19:49] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Be sligltly less inaccurate about 'username' [puppet] - 10https://gerrit.wikimedia.org/r/242032 (owner: 10Yuvipanda) [01:22:04] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2005_v6 [01:23:14] PROBLEM - Check size of conntrack table on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:23:54] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [01:24:24] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:54] RECOVERY - Check size of conntrack table on analytics1035 is OK: OK: nf_conntrack is 0 % full [01:27:05] (03PS1) 10Catrope: Enable Flow opt-in on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242035 [01:28:15] (03CR) 10Catrope: [C: 04-2] "Hold until after wmf25 is deployed to mediawikiwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242035 (owner: 10Catrope) [01:29:53] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [01:32:24] (03PS1) 10Jforrester: VisualEditor: Introduce wmgVisualEditorTransitionDefault and set true for Labs enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242037 [01:34:49] (03PS1) 10Yuvipanda: ldap: Rewrite ssh lookup script [puppet] - 10https://gerrit.wikimedia.org/r/242039 (https://phabricator.wikimedia.org/T114063) [01:37:44] (03PS1) 10Jforrester: VisualEditor: Set TransitionDefault true for the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242040 [01:37:46] (03PS1) 10Jforrester: VisualEditor: Switch to opt-out for English Wikipedia logged-in users only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242041 (https://phabricator.wikimedia.org/T112348) [01:37:48] (03PS1) 10Jforrester: VisualEditor: Enabled for logged-out users on the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242042 (https://phabricator.wikimedia.org/T90662) [01:38:18] (03CR) 10Jforrester: [C: 04-2] "Not remotely ready, community-wise." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242042 (https://phabricator.wikimedia.org/T90662) (owner: 10Jforrester) [01:38:43] (03CR) 10Jforrester: [C: 04-1] "Only once T112352 has been completed and the community informed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242041 (https://phabricator.wikimedia.org/T112348) (owner: 10Jforrester) [01:38:55] (03CR) 10Jforrester: [C: 04-1] "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242040 (owner: 10Jforrester) [01:39:07] (03CR) 10Yuvipanda: "Tested!" [puppet] - 10https://gerrit.wikimedia.org/r/242039 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [01:39:21] (03CR) 10Catrope: [C: 032] VisualEditor: Introduce wmgVisualEditorTransitionDefault and set true for Labs enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242037 (owner: 10Jforrester) [01:39:27] (03Merged) 10jenkins-bot: VisualEditor: Introduce wmgVisualEditorTransitionDefault and set true for Labs enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242037 (owner: 10Jforrester) [01:40:14] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2024_v6 [01:41:06] (03CR) 10Jforrester: "It's not wmf25, it's it's -wmf.1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242035 (owner: 10Catrope) [01:41:23] (03PS2) 10Yuvipanda: Tools: Install flex on bastions [puppet] - 10https://gerrit.wikimedia.org/r/241727 (https://phabricator.wikimedia.org/T114003) (owner: 10Tim Landscheidt) [01:41:31] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Install flex on bastions [puppet] - 10https://gerrit.wikimedia.org/r/241727 (https://phabricator.wikimedia.org/T114003) (owner: 10Tim Landscheidt) [01:41:42] (03PS1) 10Catrope: Remove extraneous dollar sign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242043 [01:41:53] PROBLEM - LVS HTTP IPv6 on mobile-lb.codfw.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:41:55] (03CR) 10Jforrester: [C: 032] Remove extraneous dollar sign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242043 (owner: 10Catrope) [01:42:01] (03Merged) 10jenkins-bot: Remove extraneous dollar sign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242043 (owner: 10Catrope) [01:42:13] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 60 ESP OK [01:42:58] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Add wmgVisualEditorTransitionDefault (false everywhere) (duration: 00m 17s) [01:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:43:34] RECOVERY - LVS HTTP IPv6 on mobile-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 440 bytes in 0.071 second response time [01:43:37] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Plumbing for wmgVisualEditorTransitionDefault (duration: 00m 17s) [01:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:43:58] so much flapping [01:44:53] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:45:13] (03PS1) 10Yuvipanda: tools: Remove ldapspportlib use from toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/242044 (https://phabricator.wikimedia.org/T114063) [01:45:19] andrewbogott: ^ for you :) [01:46:43] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [01:47:23] PROBLEM - Check size of conntrack table on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:48:55] RECOVERY - Check size of conntrack table on analytics1035 is OK: OK: nf_conntrack is 0 % full [01:52:53] PROBLEM - LVS HTTPS IPv6 on upload-lb.codfw.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:53:25] srsly [01:54:35] RECOVERY - LVS HTTPS IPv6 on upload-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 799 bytes in 0.167 second response time [01:54:42] heh [01:55:07] !log upload-lb.codfw.wikimedia.org_ipv6 page flap [01:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:55:17] yuvipanda: lets start logging these so we can easily tell for our reports =] [01:55:24] i just did for this one upload-lb.codfw.wikimedia.org_ipv6 [01:55:40] just we know it happens a lot, and we should show we are responding, seems to be a good way.... [01:55:57] (plus if folks hate it they'll tell us ;) [01:55:58] robh: we can get that info from icinga logs too [01:56:00] and email [01:56:07] true, but that doesnt show we responded to shit [01:56:19] heh [01:56:33] but yea, we can get all the other info via email [01:56:42] so no need really to log i suppose. [01:57:03] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2008_v6 [01:57:04] indeed [01:57:25] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2024_v6 [01:58:53] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [01:59:14] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 60 ESP OK [02:02:24] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [02:03:02] yuvipanda: unfortunately icinga history doesnt last long, it keeps forgetting [02:03:14] mutante: not email! [02:03:16] oh, email ,right [02:03:17] email is forever [02:03:21] you are right [02:04:14] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2014_v6 [02:06:04] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 60 ESP OK [02:06:05] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [02:08:52] so yeah I guess codfw doesn't make the v6 service endpoint issues go away [02:09:18] I do wonder why it flipped from primarily failing/alerting one to the other. they're both monitored, it's just which has more actual traffic... [02:09:59] the other pieces of the myster is of course the ipv6 ipsec flaps, which coincide roughly on timing windows and kinda indicate v6 isn't great in general at that point [02:10:05] *mystery [02:10:17] (but could be mostly unrelated, too) [02:13:23] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [02:13:25] did we ever solve the problem with icinga log rotation? that loses some of these events too [02:15:13] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [02:17:54] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:21:07] (03CR) 10Ricordisamoa: "I don't know what to do." [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) (owner: 10Ricordisamoa) [02:23:15] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [02:24:15] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [02:24:16] !log ipv6 flap experiment: raise ipv6/route/max_size from 4096 to 131072 manually on cp20* [02:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:00] !log ipv6 flap experiment: raise ipv6/route/max_size from 4096 to 131072 manually on cp*, actually [02:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:56] !log l10nupdate@tin Synchronized php-1.26wmf24/cache/l10n: l10nupdate for 1.26wmf24 (duration: 06m 57s) [02:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:53] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [02:32:05] Hello! Getting "AphrontConnectionQueryException: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2003: Can't connect to MySQL server on 'm3-master.eqiad.wmnet' (99)." too many times on Phabricator today. What's up? [02:32:24] jynus: ^ for you when you're back, maybe? [02:34:55] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf24) at 2015-09-29 02:34:55+00:00 [02:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:24] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [02:39:23] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [02:59:27] I'm getting the same as Niharika. [02:59:40] At https://phabricator.wikimedia.org/T88044 [02:59:48] Hmm, worked on refresh. [03:02:16] It'll occur again. [03:02:39] Depends on when your request hits the faults db (I think). [03:02:53] Faulty* [03:03:07] 6operations, 6Phabricator, 7Database: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109964#1683851 (10MZMcBride) I hit this `AphrontConnectionQueryException: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with... [03:03:12] Niharika: https://phabricator.wikimedia.org/T109964#1683851 [03:04:15] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:25:02] 6operations, 10Traffic: LVS HTTPS IPv6 on mobile-lb.eqiad alert occasionally flapping - https://phabricator.wikimedia.org/T113154#1683862 (10BBlack) So, this evening the same flaps hit (as they do most evenings), but they hit codfw ipv6 service IPs rather than eqiad. Both DCs have been active and monitored al... [03:26:23] PROBLEM - NTP on pybal-test2003 is CRITICAL: NTP CRITICAL: No response from NTP server [03:27:44] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [03:32:23] RECOVERY - Disk space on labstore1002 is OK: DISK OK [03:33:34] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [03:40:54] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: puppet fail [03:55:33] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [04:01:05] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [04:03:44] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 20.69% of data above the critical threshold [100000000.0] [04:05:23] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:06:43] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [04:12:54] (03PS1) 10Dzahn: fix 'variable not enclosed' - pt1 [puppet] - 10https://gerrit.wikimedia.org/r/242055 [04:12:56] (03PS1) 10Dzahn: ganeti: fix 'variable not enclosed' pt2 [puppet] - 10https://gerrit.wikimedia.org/r/242056 [04:12:58] (03PS1) 10Dzahn: lint: fix 'variable not enclosed' pt4 [puppet] - 10https://gerrit.wikimedia.org/r/242057 [04:13:54] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [04:21:14] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [04:26:43] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [04:30:34] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [04:31:25] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:53:58] (03PS2) 10Yuvipanda: Tools: Remove dependency of toollabs::checker on toollabs::submit [puppet] - 10https://gerrit.wikimedia.org/r/241581 (https://phabricator.wikimedia.org/T113744) (owner: 10Tim Landscheidt) [04:54:10] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Remove dependency of toollabs::checker on toollabs::submit [puppet] - 10https://gerrit.wikimedia.org/r/241581 (https://phabricator.wikimedia.org/T113744) (owner: 10Tim Landscheidt) [05:00:03] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [05:01:30] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Sep 29 05:01:30 UTC 2015 (duration 1m 29s) [05:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:05:35] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [05:10:54] PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: puppet fail [05:11:05] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [05:12:54] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [05:20:13] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [05:27:34] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [05:36:53] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [05:38:33] RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [05:40:34] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [05:50:42] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10MediaWiki-extensions-CentralNotice, 10Traffic: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1683932 (10awight) 3NEW a:3Ottomata [05:51:33] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [05:55:05] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [06:00:44] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [06:08:44] PROBLEM - puppet last run on mw2071 is CRITICAL: CRITICAL: Puppet has 1 failures [06:13:53] PROBLEM - puppet last run on mw2025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:22:44] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [06:28:14] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [06:29:34] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail [06:30:04] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [06:30:23] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:43] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:54] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:38] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:54] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:33] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:35] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:25] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:44] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [06:38:20] 6operations, 6Phabricator, 7Database: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109964#1683983 (10jcrespo) Please reopen T109279, do not use this issue, as it is a duplicate of that one, and problems are centralized... [06:40:46] 6operations, 6Phabricator, 7Database: phabricator dump script should use slave db, not master - https://phabricator.wikimedia.org/T112193#1683987 (10jcrespo) p:5Normal>3High I do not have time to work on this, but using the master seems to be creating high contention, and could be the cause of new connec... [06:43:04] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [06:43:14] 6operations, 6Phabricator, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1683991 (10Aklapper) 5Resolved>3Open Reopening as per T109964#1683851 [06:44:45] (03PS1) 10Jcrespo: Revert "phab: re-enable dump script" [puppet] - 10https://gerrit.wikimedia.org/r/242067 [06:45:07] (03PS2) 10Jcrespo: Revert "phab: re-enable dump script" [puppet] - 10https://gerrit.wikimedia.org/r/242067 [06:45:39] (03CR) 10Jcrespo: [C: 032] Revert "phab: re-enable dump script" [puppet] - 10https://gerrit.wikimedia.org/r/242067 (owner: 10Jcrespo) [06:48:07] (03PS1) 10Jcrespo: Revert "Revert "phab: disable tools crons"" [puppet] - 10https://gerrit.wikimedia.org/r/242069 [06:48:23] (03PS2) 10Jcrespo: Revert "Revert "phab: disable tools crons"" [puppet] - 10https://gerrit.wikimedia.org/r/242069 [06:48:43] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [06:49:56] (03PS3) 10Jcrespo: Revert "Revert "phab: disable tools crons"" [puppet] - 10https://gerrit.wikimedia.org/r/242069 [06:51:24] 6operations, 6Phabricator, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1683997 (10jcrespo) [06:52:04] (03CR) 10Jcrespo: [C: 032] Revert "Revert "phab: disable tools crons"" [puppet] - 10https://gerrit.wikimedia.org/r/242069 (owner: 10Jcrespo) [06:55:25] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:55:45] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:56:44] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:57:14] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:57:23] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:24] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:04] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:58:14] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:58:24] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:58:24] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:58:48] !log checked out correct phabricator release tag on iridium. Something, somewhere, had reverted everything to an old deployment. [06:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:59:07] chasemp ^ any ideas? [06:59:43] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:02:24] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:02:56] 6operations, 6Phabricator, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1684018 (10jcrespo) I've reduced the t... [07:03:34] !log restarted apache2 and phd or iridium to get phabricator back into the correct state [07:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:05:23] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [07:05:27] (03PS5) 10Jcrespo: Populate labsdb1004 with mariadb [puppet] - 10https://gerrit.wikimedia.org/r/218874 (https://phabricator.wikimedia.org/T88718) [07:06:34] (03CR) 10Jcrespo: [C: 032] Populate labsdb1004 with mariadb [puppet] - 10https://gerrit.wikimedia.org/r/218874 (https://phabricator.wikimedia.org/T88718) (owner: 10Jcrespo) [07:06:58] ^ yuvipanda [07:07:54] RECOVERY - puppet last run on mw2071 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:09:03] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:11:04] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [5000000.0] [07:11:13] RECOVERY - puppet last run on mw2025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:11:24] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [5000000.0] [07:14:24] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [07:16:34] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [07:16:55] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [07:17:03] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Puppet has 2 failures [07:29:14] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:31:04] (03CR) 10QChris: "> I don't know what to do." [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) (owner: 10Ricordisamoa) [07:36:03] PROBLEM - mysqld processes on labsdb1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [07:37:01] that is me, it is being setup, no worries [07:37:34] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [07:37:35] jynus: ah, nice! thanks [07:37:40] (I cannot ack if it wasn't on icinga before ) [07:37:52] yeah, no worries [07:37:57] unless I continue refreshing all the time [07:39:13] ACKNOWLEDGEMENT - mysqld processes on labsdb1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Jcrespo setting up mysql for the first time [07:39:17] ACKNOWLEDGEMENT - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Puppet has 1 failures Jcrespo setting up mysql for the first time [07:39:34] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:40:43] <_joe_> jynus: if you want to avoid this, schedule downtime as soon as the host is up [07:40:48] <_joe_> I know, not always easy :) [07:40:57] yeah, that was the catch [07:41:19] but it was a new service, host was existent [07:41:40] I could have downtimed the whole host, but I think that is worse [07:44:09] (03PS1) 10Giuseppe Lavagetto: elasticsearch: re-enter rack info for elastic1006 [puppet] - 10https://gerrit.wikimedia.org/r/242082 (https://phabricator.wikimedia.org/T112559) [07:44:45] <_joe_> jynus: agreed [07:45:04] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [07:47:59] ACKNOWLEDGEMENT - puppet last run on elastic1006 is CRITICAL: CRITICAL: puppet fail Giuseppe Lavagetto will be fixed when 242082 gets merged [07:48:53] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:54:24] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [07:56:34] (03PS1) 10Jcrespo: Installing MariaDB 10 on tool's slave instead of 5.5 [puppet] - 10https://gerrit.wikimedia.org/r/242085 [07:57:22] (03CR) 10Jcrespo: [C: 032] Installing MariaDB 10 on tool's slave instead of 5.5 [puppet] - 10https://gerrit.wikimedia.org/r/242085 (owner: 10Jcrespo) [08:00:35] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:05:03] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100 [08:07:23] <_joe_> what's the actionable for this alarm ^^ [08:07:34] <_joe_> I know nothing about mailman 3 [08:18:04] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100 [08:18:17] (03CR) 10Filippo Giunchedi: [C: 031] varnish: misspass limiter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [08:20:46] (03CR) 10Filippo Giunchedi: [C: 031] ganeti: fix 'variable not enclosed' pt2 [puppet] - 10https://gerrit.wikimedia.org/r/242056 (owner: 10Dzahn) [08:39:08] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [08:44:39] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [08:48:08] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures [08:48:28] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [08:51:57] !log starting cloning of labsdb1005 (Tools DB), minimal disruption is expected [08:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:01:09] Hi, who's managing the chapters wikis? I've talked about it some time ago, but my logs got lost in corruption :S [09:02:02] 6operations, 6Phabricator, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1684166 (10mmodell) I still don't have... [09:08:48] PROBLEM - SSH on pybal-test2003 is CRITICAL: Server answer [09:12:47] RECOVERY - DPKG on pybal-test2003 is OK: All packages OK [09:12:47] RECOVERY - RAID on pybal-test2003 is OK: OK: no RAID installed [09:12:58] RECOVERY - dhclient process on pybal-test2003 is OK: PROCS OK: 0 processes with command name dhclient [09:13:08] RECOVERY - configured eth on pybal-test2003 is OK: OK - interfaces up [09:13:30] !log powercycling pybal-test2003 [09:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:13:41] "power" [09:14:07] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:15:39] RECOVERY - salt-minion processes on pybal-test2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:15:48] RECOVERY - Disk space on pybal-test2003 is OK: DISK OK [09:16:08] RECOVERY - SSH on pybal-test2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [09:16:18] RECOVERY - puppet last run on pybal-test2003 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:35:59] RECOVERY - NTP on pybal-test2003 is OK: NTP OK: Offset -0.001456737518 secs [09:42:13] (03PS2) 10Giuseppe Lavagetto: dynamicproxy: add support for kubernetes WIP [puppet] - 10https://gerrit.wikimedia.org/r/241908 [09:42:57] (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: add support for kubernetes WIP [puppet] - 10https://gerrit.wikimedia.org/r/241908 (owner: 10Giuseppe Lavagetto) [09:49:47] (03PS1) 10Siebrand: Rename Azerbaijani Wikisource project and namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242096 (https://phabricator.wikimedia.org/T114002) [09:52:00] (03PS1) 10Filippo Giunchedi: install_server: cassandra to /srv for 2 ssd hosts [puppet] - 10https://gerrit.wikimedia.org/r/242098 (https://phabricator.wikimedia.org/T113714) [09:53:28] (03PS3) 10Filippo Giunchedi: restbase: add LVS codfw configuration [puppet] - 10https://gerrit.wikimedia.org/r/240088 (https://phabricator.wikimedia.org/T108613) [09:53:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: add LVS codfw configuration [puppet] - 10https://gerrit.wikimedia.org/r/240088 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [10:07:47] 6operations, 7Pybal: conftool backend errors during merge - https://phabricator.wikimedia.org/T114091#1684335 (10fgiunchedi) 3NEW [10:08:19] 6operations, 6Services, 5Patch-For-Review, 7RESTBase-architecture: Separate /var on restbase100x - https://phabricator.wikimedia.org/T113714#1684343 (10fgiunchedi) we're going to piggyback on multi-instance work for this too, plan is to start with restbase-test2* machines and start converting to multi inst... [10:37:01] (03PS12) 10Filippo Giunchedi: cassandra: WIP support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) [10:37:09] (03CR) 10Filippo Giunchedi: cassandra: WIP support for multiple instances (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [10:39:03] PROBLEM - LVS HTTP IPv4 on restbase.svc.codfw.wmnet is CRITICAL: Connection refused [10:39:36] sorry for the page, that's me [10:41:42] ok [10:49:36] !log bounce pybal on lvs2003 / lvs2006 [10:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:56:13] (03PS1) 10Filippo Giunchedi: restbase: split realserver_ips in codfw/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/242110 [10:56:20] and the root cause ^ [10:57:41] (03CR) 10Alexandros Kosiaris: [C: 031] ganeti: fix 'variable not enclosed' pt2 [puppet] - 10https://gerrit.wikimedia.org/r/242056 (owner: 10Dzahn) [10:57:48] (03PS2) 10Filippo Giunchedi: restbase: split realserver_ips in codfw/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/242110 [10:57:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: split realserver_ips in codfw/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/242110 (owner: 10Filippo Giunchedi) [11:02:21] 6operations, 10Traffic: LVS HTTPS IPv6 on mobile-lb.eqiad alert occasionally flapping - https://phabricator.wikimedia.org/T113154#1684510 (10BBlack) Update: log tail caught a few 1/3 soft fails on ipsec ipv6 still, but those could be due to legitimate timing and/or packet loss. The rate of them is much lower... [11:04:14] PROBLEM - Disk space on mw1152 is CRITICAL: DISK CRITICAL - free space: /tmp 426 MB (2% inode=99%) [11:05:24] RECOVERY - LVS HTTP IPv4 on restbase.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 - 15118 bytes in 0.114 second response time [11:06:05] RECOVERY - Disk space on mw1152 is OK: DISK OK [11:10:35] (03Abandoned) 10Hashar: parsoid: Remove parsoid beta role [puppet] - 10https://gerrit.wikimedia.org/r/193082 (https://phabricator.wikimedia.org/T86633) (owner: 10Yuvipanda) [11:10:54] 6operations, 10Datasets-General-or-Unknown: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503#1684521 (10ArielGlenn) p:5Low>3Normal [11:11:10] 6operations, 10Traffic: LVS HTTPS IPv6 on mobile-lb.eqiad alert occasionally flapping - https://phabricator.wikimedia.org/T113154#1684524 (10BBlack) See also https://bugzilla.redhat.com/show_bug.cgi?id=1221915 and the similar reports linked within, etc... [11:11:34] 6operations, 10Datasets-General-or-Unknown: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503#605079 (10ArielGlenn) Since we have the new array in place for some time now, let's revisit this and see how much more we can serve from WMF servers. [11:12:17] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: Make pybal accept 30[12] for ProxyFetch - https://phabricator.wikimedia.org/T102393#1684526 (10Aklapper) @ori, @bblack: Any plans to rework that last patch? Asking as this task is blocking T113151 which has priority "Unbreak now". [11:29:48] (03PS4) 10Mforns: [WIP] Consume EventLogging validation logs from Logstash [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) [11:33:49] (03CR) 10Tim Landscheidt: "For the proxies we have the setup with one nginx/Redis master and one (nginx hot-spare/)Redis slave (cf. $active_proxy), so that a failure" [puppet] - 10https://gerrit.wikimedia.org/r/241908 (owner: 10Giuseppe Lavagetto) [11:37:04] PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:33] (03CR) 10Mforns: [WIP] Consume EventLogging validation logs from Logstash (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) (owner: 10Mforns) [11:42:56] (03CR) 10Mforns: [C: 04-1] "Still WIP" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) (owner: 10Mforns) [11:57:19] (03PS2) 10Rush: elasticsearch: re-enter rack info for elastic1006 [puppet] - 10https://gerrit.wikimedia.org/r/242082 (https://phabricator.wikimedia.org/T112559) (owner: 10Giuseppe Lavagetto) [11:58:06] (03PS1) 10Rush: (re)define git-ssh for LVS [dns] - 10https://gerrit.wikimedia.org/r/242112 [11:58:48] (03CR) 10Rush: [C: 032] elasticsearch: re-enter rack info for elastic1006 [puppet] - 10https://gerrit.wikimedia.org/r/242082 (https://phabricator.wikimedia.org/T112559) (owner: 10Giuseppe Lavagetto) [11:58:57] (03CR) 10Rush: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/242082 (https://phabricator.wikimedia.org/T112559) (owner: 10Giuseppe Lavagetto) [12:01:53] RECOVERY - puppet last run on elastic1006 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [12:04:33] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [12:07:00] (03CR) 10Alex Monk: Rename Azerbaijani Wikisource project and namespaces (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242096 (https://phabricator.wikimedia.org/T114002) (owner: 10Siebrand) [12:13:05] 6operations, 6Phabricator, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1684624 (10chasemp) I'm going to chang... [12:22:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Hopefully a last round of comments" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [12:24:42] (03PS2) 10Rush: (re)define git-ssh for LVS [dns] - 10https://gerrit.wikimedia.org/r/242112 [12:26:32] (03CR) 10BBlack: [C: 031] "needs puppet changes for LVS/realserver in sep commit too:" [dns] - 10https://gerrit.wikimedia.org/r/242112 (owner: 10Rush) [12:29:22] (03PS1) 10Rush: (re)define ip for git-ssh.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/242113 [12:30:31] (03CR) 10Rush: [C: 032] (re)define git-ssh for LVS [dns] - 10https://gerrit.wikimedia.org/r/242112 (owner: 10Rush) [12:31:37] (03CR) 10Rush: [C: 032] (re)define ip for git-ssh.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/242113 (owner: 10Rush) [12:33:28] akosiaris, I suppose you are working with OTRS-slave, right? [12:36:53] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [12:38:00] jynus: yup [12:41:40] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1684659 (10Joe) \o/ Resolving this, and reimaging the remaining videoscaler! @thedj I'll look into that specific bug today. [12:46:04] 6operations, 10MediaWiki-General-or-Unknown, 10Wikimedia-Video: HHVM timeouts mean videoscaling can't clean locally transcoded files from the filesystem - https://phabricator.wikimedia.org/T113447#1684683 (10Joe) [12:46:07] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 7HHVM, 5Patch-For-Review: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1684684 (10Joe) [12:46:10] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1684681 (10Joe) 5Open>3Resolved p:5Triage>3High [12:47:44] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:47:48] 6operations, 10MediaWiki-General-or-Unknown, 10Wikimedia-Video: HHVM timeouts mean videoscaling can't clean locally transcoded files from the filesystem - https://phabricator.wikimedia.org/T113447#1684687 (10Joe) 5Open>3Resolved a:3Joe [12:50:27] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 7HHVM, 5Patch-For-Review: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1684693 (10Krenair) [12:55:37] (03PS1) 10Filippo Giunchedi: restbase-test2001 additional cassandra instances [dns] - 10https://gerrit.wikimedia.org/r/242117 (https://phabricator.wikimedia.org/T95253) [12:57:16] (03PS1) 10Rush: phab: limit system sshd to local address [puppet] - 10https://gerrit.wikimedia.org/r/242118 [12:58:35] (03CR) 10Rush: [C: 032] phab: limit system sshd to local address [puppet] - 10https://gerrit.wikimedia.org/r/242118 (owner: 10Rush) [12:58:55] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, and 2 others: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1684724 (10chasemp) [13:03:05] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [13:06:05] 6operations, 7Pybal: pybal doesn't fully manage LVS table leaving stale services (on IP change) - https://phabricator.wikimedia.org/T114104#1684739 (10chasemp) 3NEW [13:09:41] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1684759 (10fgiunchedi) [13:09:44] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: assess impact of many cassandra seed nodes with multi instance - https://phabricator.wikimedia.org/T113939#1684757 (10fgiunchedi) 5Open>3Resolved sounds like there's no immediate need to do for multi-instance, agreed on the puppet work (... [13:11:07] 6operations, 10Traffic: LVS HTTPS IPv6 on mobile-lb.eqiad alert occasionally flapping - https://phabricator.wikimedia.org/T113154#1684762 (10faidon) >>! In T113154#1683862, @BBlack wrote: > Digging around a bit and thinking, I stumbled on a new theory: this might be because of the small fixed default size of `... [13:12:14] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:15:15] 6operations, 10Analytics-Cluster, 5Patch-For-Review: php5-curl for stat1002 - https://phabricator.wikimedia.org/T113602#1684771 (10Addshore) 5Open>3Resolved [13:16:09] (03PS3) 10Giuseppe Lavagetto: dynamicproxy: add support for kubernetes WIP [puppet] - 10https://gerrit.wikimedia.org/r/241908 [13:16:59] (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: add support for kubernetes WIP [puppet] - 10https://gerrit.wikimedia.org/r/241908 (owner: 10Giuseppe Lavagetto) [13:17:43] (03PS1) 10BBlack: Bump v6 route max_size to 131072 for all [puppet] - 10https://gerrit.wikimedia.org/r/242122 (https://phabricator.wikimedia.org/T113154) [13:23:02] (03PS2) 10Faidon Liambotis: Bump IPv6 route max_size to 131072 for all [puppet] - 10https://gerrit.wikimedia.org/r/242122 (https://phabricator.wikimedia.org/T113154) (owner: 10BBlack) [13:23:05] (OCD) [13:23:41] (03CR) 10Faidon Liambotis: [C: 032] "My own research on both the setting and the number of IPv6 hits supports this. I'm surprised it hasn't been a bigger problem so far, actua" [puppet] - 10https://gerrit.wikimedia.org/r/242122 (https://phabricator.wikimedia.org/T113154) (owner: 10BBlack) [13:41:48] (03PS10) 10Milimetric: Add Analytics Query Service role [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) [13:41:53] (03CR) 10Milimetric: Add Analytics Query Service role (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [13:42:32] akosiaris: thanks for the comments, addressed ^ [13:44:36] (03PS1) 10Giuseppe Lavagetto: videoscaler: reimage and rename tmh1002 as mw1260, upgrade to HAT [puppet] - 10https://gerrit.wikimedia.org/r/242125 (https://phabricator.wikimedia.org/T104747) [13:46:54] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [13:49:06] 6operations, 10Traffic, 5Patch-For-Review: LVS HTTPS IPv6 on mobile-lb.eqiad alert occasionally flapping - https://phabricator.wikimedia.org/T113154#1684881 (10faidon) FWIW, I did some additional searching and ended up... in Facebook [13:52:11] (03PS1) 10Giuseppe Lavagetto: wmnet: rename tmh1002 => mw1260 [dns] - 10https://gerrit.wikimedia.org/r/242127 [13:53:57] 6operations, 10Traffic, 5Patch-For-Review: LVS HTTPS IPv6 on mobile-lb.eqiad alert occasionally flapping - https://phabricator.wikimedia.org/T113154#1684894 (10BBlack) ^ +1 [13:54:55] (03PS1) 10Faidon Liambotis: Revert "Set eqiad's admin_state to "down"" [dns] - 10https://gerrit.wikimedia.org/r/242130 [13:54:55] (03PS2) 10Faidon Liambotis: Revert "Set eqiad's admin_state to "down"" [dns] - 10https://gerrit.wikimedia.org/r/242130 [13:54:56] (03PS3) 10Faidon Liambotis: Set eqiad's admin_state to "up" [dns] - 10https://gerrit.wikimedia.org/r/242130 [13:55:04] bblack: ^ [13:56:35] (03CR) 10BBlack: [C: 031] Set eqiad's admin_state to "up" [dns] - 10https://gerrit.wikimedia.org/r/242130 (owner: 10Faidon Liambotis) [13:57:38] (03CR) 10Alexandros Kosiaris: [C: 04-1] "-1ed due a small error in fixing the previous commit (actually my fault on previous comment), otherwise LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [13:57:57] !ban elastic1006 for T112559 [13:58:02] milimetric: sorry, my bad about a comment. ^ Otherwise LGTM [13:58:06] k, fixing now [13:58:13] "Due to high database server lag, changes newer than 76 seconds may not appear in this list." [13:58:16] Hmm [13:59:29] akosiaris: I think maybe you didn't submit your comment draft? [13:59:37] (03PS1) 10Giuseppe Lavagetto: wmnet: rename tmh1001 and 1002 mgmt interfaces [dns] - 10https://gerrit.wikimedia.org/r/242132 [14:00:01] (03CR) 10Alexandros Kosiaris: Add Analytics Query Service role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [14:00:14] milimetric: I replied on PS9 as it seems [14:00:54] no, posted the comments on PS9, and then replied on PS10... [14:01:07] so they never got submitted and stayed as drafts [14:01:33] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:01:45] (03CR) 10Giuseppe Lavagetto: [C: 032] wmnet: rename tmh1002 => mw1260 [dns] - 10https://gerrit.wikimedia.org/r/242127 (owner: 10Giuseppe Lavagetto) [14:02:12] (03PS11) 10Milimetric: Add Analytics Query Service role [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) [14:02:44] oops, that wasn't your fault, Alex, that was definitely me spacing out, sorry 'bout that [14:03:39] (03PS2) 10Giuseppe Lavagetto: videoscaler: reimage and rename tmh1002 as mw1260, upgrade to HAT [puppet] - 10https://gerrit.wikimedia.org/r/242125 (https://phabricator.wikimedia.org/T104747) [14:06:23] (03PS4) 10Faidon Liambotis: Set eqiad's admin_state to "up" [dns] - 10https://gerrit.wikimedia.org/r/242130 [14:06:27] (03CR) 10Giuseppe Lavagetto: [C: 032] videoscaler: reimage and rename tmh1002 as mw1260, upgrade to HAT [puppet] - 10https://gerrit.wikimedia.org/r/242125 (https://phabricator.wikimedia.org/T104747) (owner: 10Giuseppe Lavagetto) [14:06:29] (03CR) 10Faidon Liambotis: [C: 032] Set eqiad's admin_state to "up" [dns] - 10https://gerrit.wikimedia.org/r/242130 (owner: 10Faidon Liambotis) [14:07:22] come on jenkins [14:07:31] (03CR) 10Faidon Liambotis: [V: 032] Set eqiad's admin_state to "up" [dns] - 10https://gerrit.wikimedia.org/r/242130 (owner: 10Faidon Liambotis) [14:08:09] !log repooling eqiad; 24h codfw test window is over [14:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:31] (03CR) 10Mobrovac: "Note: the LVS IP is still missing." [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [14:15:29] (03PS1) 10Alexandros Kosiaris: Introducing aqs.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/242134 [14:16:30] milimetric: ^ that should be it [14:30:37] (03PS5) 10Andrew Bogott: Rsync: Unquote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241235 (https://phabricator.wikimedia.org/T113783) [14:30:39] (03PS4) 10Andrew Bogott: interface: dequote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241237 (https://phabricator.wikimedia.org/T113783) [14:30:41] (03PS4) 10Andrew Bogott: Cassandra: dequote some booleans. [puppet] - 10https://gerrit.wikimedia.org/r/241238 (https://phabricator.wikimedia.org/T113783) [14:30:43] (03PS4) 10Andrew Bogott: Grafana: dequote booleans. [puppet] - 10https://gerrit.wikimedia.org/r/241239 (https://phabricator.wikimedia.org/T113783) [14:30:45] (03PS4) 10Andrew Bogott: Gerrit role: dequote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241240 (https://phabricator.wikimedia.org/T113783) [14:30:47] (03PS4) 10Andrew Bogott: Mark salt grain bool values with # lint:ignore:quoted_booleans [puppet] - 10https://gerrit.wikimedia.org/r/241241 (https://phabricator.wikimedia.org/T113783) [14:30:49] (03PS4) 10Andrew Bogott: webserver::php5 unquote a boolean. [puppet] - 10https://gerrit.wikimedia.org/r/241242 (https://phabricator.wikimedia.org/T113783) [14:30:51] (03PS4) 10Andrew Bogott: Webserver ca: disable the quoted-bool lint check [puppet] - 10https://gerrit.wikimedia.org/r/241243 (https://phabricator.wikimedia.org/T113783) [14:30:53] (03PS4) 10Andrew Bogott: Diamond: Turn off lint check for quoted bools. [puppet] - 10https://gerrit.wikimedia.org/r/241244 (https://phabricator.wikimedia.org/T113783) [14:30:55] (03PS4) 10Andrew Bogott: Disable quoted_boolean lint check around is_virtual refs. [puppet] - 10https://gerrit.wikimedia.org/r/241245 (https://phabricator.wikimedia.org/T113783) [14:34:50] (03PS1) 10Muehlenhoff: Various bugfixes / tweaks [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/242140 [14:38:07] (03CR) 10Alexandros Kosiaris: "that would be" [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [14:39:33] (03CR) 10Alexandros Kosiaris: "that would https://gerrit.wikimedia.org/r/242134" [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [14:44:13] (03CR) 10Andrew Bogott: [C: 032] Rsync: Unquote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241235 (https://phabricator.wikimedia.org/T113783) (owner: 10Andrew Bogott) [14:46:06] (03PS12) 10Milimetric: Add Analytics Query Service role [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) [14:49:29] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 339 [14:49:53] uhm? [14:50:06] jynus? [14:50:09] jynus: ^? [14:50:16] never seen that page before [14:50:33] and it shouldn't [14:50:38] I was thinking teh same thing akosiaris [14:50:38] it is not a production host [14:52:23] I am deleting some old bugged preferences from s1 at the moment [14:52:27] or was, actually it just moved on to another db [14:52:43] It should be waiting for slaves, although I'm pretty sure db1047 isn't taken into account [14:52:56] that's one of the analytics dbs isn't it? [14:53:03] 6operations, 10RESTBase: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#1685098 (10mobrovac) [14:53:13] 1047 laggin there is ok, analytics use it a lot, and that is ok [14:53:35] it should not page operations [14:53:55] dbtree is showing lag 0 for that host [14:54:03] nothing breaks if it lags [14:54:11] I think there's a ticket about that somewhere, multi-source replication issue was it? [14:54:52] yes, low priority, because again, it is not a production host and tendril has the right code for it [14:54:59] https://tendril.wikimedia.org/host/view/db1047.eqiad.wmnet/3306 [14:55:42] I think this became critical due to this https://gerrit.wikimedia.org/r/#/c/241246/ [14:55:59] oh [14:56:08] that is why it also paged everyone [14:56:17] that's my assumption at this point [14:56:17] today at labsdb1004 [14:56:32] * andrewbogott counts this as a win [14:56:36] which is not set as critical, afaik [14:56:39] akosiaris: is it because it should have always been doing it (technically) and now is correctly or because that caused some weirdness [14:56:53] oh, maybe not [14:57:20] analytics hosts should not wake up ops [14:57:33] it should wake up me or otto [14:57:37] but not everyone [14:57:57] so… did my patch accidentally switch every service to critical? [14:58:45] strings evaluate to true [14:58:50] so "false" => true [14:58:53] IIRC [15:00:05] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150929T1500). Please do the needful. [15:00:05] James_F Krenair: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:14] * James_F waves. [15:00:27] yeah, that should be it [15:00:37] andrewbogott: mind reverting that ? [15:00:53] we should first track down the callers and amend them before merging it [15:01:11] akosiaris: I did track down the callers [15:01:17] where would that change have caused this, still not understanding [15:01:20] Or, I tried to at least. [15:01:27] James_F: I can SWAT. [15:01:33] Cool, thanks. [15:01:33] oh, you know… I missed one which got lumped into a later patch, let me find that [15:02:03] Could it have been this? https://gerrit.wikimedia.org/r/#/c/241235/5/manifests/role/analytics/kafka.pp [15:02:36] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242040 (owner: 10Jforrester) [15:02:38] andrewbogott: did you also chase down all the submodules ? [15:02:46] (03Merged) 10jenkins-bot: VisualEditor: Set TransitionDefault true for the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242040 (owner: 10Jforrester) [15:02:48] like mariadb ? [15:02:51] akosiaris: good point [15:03:06] akosiaris: that’s probably it. Mind if I move forward rather than backwards? [15:03:32] as long as you chase down all callers, no [15:03:50] ‘k [15:03:56] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10MediaWiki-extensions-CentralNotice, 10Traffic: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1685110 (10Nuria) Please note that varnish logging limits have been increased and that the l... [15:04:41] PROBLEM - nutcracker port on mw1260 is CRITICAL: Timeout while attempting connection [15:04:58] akosiaris: neither ‘true’ nor ‘false’ appears in the mariadb submodule [15:05:01] PROBLEM - nutcracker process on mw1260 is CRITICAL: Timeout while attempting connection [15:05:21] PROBLEM - puppet last run on mw1260 is CRITICAL: Timeout while attempting connection [15:05:32] PROBLEM - salt-minion processes on mw1260 is CRITICAL: Timeout while attempting connection [15:05:40] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: VisualEditor: Set TransitionDefault true for the English Wikipedia [[gerrit:242040]] (duration: 00m 17s) [15:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:51] PROBLEM - Check size of conntrack table on mw1260 is CRITICAL: Timeout while attempting connection [15:05:51] ^ James_F check please [15:05:59] andrewbogott: yeah, I see that... something trigged that though.. need to find it... [15:06:01] PROBLEM - DPKG on mw1260 is CRITICAL: Timeout while attempting connection [15:06:11] PROBLEM - Disk space on mw1260 is CRITICAL: Timeout while attempting connection [15:06:13] <_joe_> thcipriani: mw1260 is still being reimaged [15:06:13] oh damn [15:06:19] it was critical all along ? [15:06:32] PROBLEM - RAID on mw1260 is CRITICAL: Connection refused by host [15:06:40] so db1047 is critical? [15:06:42] thcipriani: Yup, looks good to me. [15:06:49] <_joe_> so if you're SWATTing, it's expected to time out [15:06:56] jynus: I think it was all along [15:07:01] PROBLEM - configured eth on mw1260 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:05] <_joe_> andrewbogott: need a review of your bool patches? [15:07:11] PROBLEM - dhclient process on mw1260 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:16] _joe_: yup, just saw that, hostkey was refused, so all good. We can update later. Thanks for the heads up! [15:07:18] _joe_: they 're all merged [15:07:27] James_F: awesome. Thanks for checking [15:07:32] akosiaris, _joe_, no they aren’t, only 2 of 10 are merged. [15:07:38] then what changed is: 1) people reviewing changes all along; 2) someone, maybe icinga, dropping the "do not alert" on icinga [15:07:40] tclark: Thank you. [15:07:40] so, _joe_, yes please [15:07:58] s/reviewing changes/receiving pages/ [15:08:20] _joe_: the thread starts here: https://gerrit.wikimedia.org/r/#/c/241237/ [15:08:20] will review the mariadb::analytics role [15:08:22] andrewbogott: it's this https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:dequote3,n,z , right ? [15:08:30] <_joe_> andrewbogott: thanks [15:08:35] <_joe_> andrewbogott: use the compiler :) [15:08:45] <_joe_> to see the differences on specific hosts [15:08:46] _joe_: yep, did for the two I merged. [15:08:51] akosiaris: yes [15:09:15] actually for so big changes, probably all hosts would be nice [15:09:29] <_joe_> thcipriani: it will update itself as it will run sync-common once it's done [15:09:43] <_joe_> akosiaris: for big changes, yes, are these big changes? [15:10:05] all hosts? Is there an option for that in the compiler? [15:10:36] <_joe_> andrewbogott: kind of, yes [15:10:41] <_joe_> andrewbogott: btw, I see you did [15:10:49] <_joe_> if $somevar == true [15:10:50] jynus: it's this line https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/manifests/role/mariadb.pp;49f949e85dfe6bc330810f07552e9c81619dc24a$348 [15:10:55] <_joe_> please do if $somevar [15:11:09] <_joe_> it's cleaner and would catch even casts [15:11:23] <_joe_> but maybe that is just my taste [15:11:26] akosiaris, that indeed can be fine-tuned [15:11:36] _joe_: Yeah, I agree… I was thinking that if the patch ONLY stripped quotes that it would be easier to review [15:11:53] But in general, “== true” looks stupid to me too [15:12:37] jynus: actually it can not...see this https://phabricator.wikimedia.org/diffusion/OPMD/browse/master/manifests/monitor_replication.pp [15:12:53] so, the call in the role should anyway see the is_critical parameter [15:13:01] but it is not used anyway [15:13:07] ha ha [15:13:31] so we always passed true, which internally was not "true" and hence false [15:13:58] when andrewbogott changed the call, suddently true was true and the state of the check changed [15:14:25] yeah, there are lots of places where... [15:14:30] someone tried to enable paging but failed [15:14:42] So there will be some new pages, but in theory those are all ‘on purpose' [15:14:48] ok, so, I can fix the class, but we may want to discuss what is critical and what is not [15:15:21] RECOVERY - Disk space on mw1260 is OK: DISK OK [15:15:32] RECOVERY - nutcracker port on mw1260 is OK: TCP OK - 0.000 second response time on port 11212 [15:15:51] RECOVERY - RAID on mw1260 is OK: OK [15:15:52] RECOVERY - nutcracker process on mw1260 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [15:16:05] or maybe create more specific contact groups [15:16:12] RECOVERY - configured eth on mw1260 is OK: OK - interfaces up [15:16:22] RECOVERY - dhclient process on mw1260 is OK: PROCS OK: 0 processes with command name dhclient [15:16:32] RECOVERY - salt-minion processes on mw1260 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:16:43] RECOVERY - Check size of conntrack table on mw1260 is OK: OK: nf_conntrack is 0 % full [15:16:52] RECOVERY - DPKG on mw1260 is OK: All packages OK [15:18:14] but that was already only for dbas [15:19:43] !log ban elastic1031 for T112559 [15:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:40] (03PS1) 10RobH: putting andrew into sms paging group [puppet] - 10https://gerrit.wikimedia.org/r/242154 [15:26:29] robh: temptation to -1 per OCD :p [15:26:36] (03CR) 10RobH: [C: 032] putting andrew into sms paging group [puppet] - 10https://gerrit.wikimedia.org/r/242154 (owner: 10RobH) [15:28:10] (03PS2) 10RobH: putting andrew into sms paging group [puppet] - 10https://gerrit.wikimedia.org/r/242154 [15:28:24] 10Ops-Access-Requests, 6operations: Analytics statistics-users access on stat1002 for dpatrick - https://phabricator.wikimedia.org/T114119#1685215 (10csteipp) [15:29:12] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: puppet fail [15:32:21] _joe_: tell me more about how to run a puppet compiler job for all hosts? [15:32:31] (03PS1) 10John F. Lewis: admin: add dpatrick to statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/242163 (https://phabricator.wikimedia.org/T114119) [15:33:14] JohnFLewis: you mean in that its not all alphabetical? i know! [15:33:26] robh: {{sofixit}} :) [15:33:53] {{laterrrrrrrrrrrrrrr}} [15:34:02] like my admin alphabetization [15:34:11] i did get to it a month later and it was nice and clean for months! [15:34:13] robh: the order doesn't annoy me. the fact its in order and not in order at the same time. like what the fuck [15:34:31] we need to caution the opsen who merged the first patch that killed its perfect-ness! [15:35:29] (03CR) 10BryanDavis: [WIP] Consume EventLogging validation logs from Logstash (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) (owner: 10Mforns) [15:37:20] andrewbogott: I was looking at https://gerrit.wikimedia.org/r/#/c/241235/5/modules/openstack/manifests/nova/compute.pp,unified [15:37:31] curious as to why change 'no' to no [15:37:45] it's not a boolean, no ? [15:37:52] https://github.com/wikimedia/operations-puppet/commit/9af5de52afdfdbb4b3e3e14c8a0fa82fe5411613 the first commit which killed order :( [15:38:05] akosiaris: I think it is — but I will double-check [15:39:02] !log cp1065: live-testing some VCL patches, puppet disabled, etc... [15:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:29] andrewbogott: The boolean data type has two possible values: true and false. Literal booleans must be one of these two bare words (that is, not quoted). [15:39:34] https://docs.puppetlabs.com/puppet/latest/reference/lang_data_boolean.html [15:39:36] "no" would be true but it \'no\' is not a boolean as false [15:40:40] chasemp: I… can’t tell what you’re saying [15:40:54] I meant something there :) yes/no are not boolean equivs [15:41:04] even tho technically "no" or "yes" would eval to true [15:41:07] ok, then that patch is probably stupid. I’ll revert [15:41:21] yeah, I was about to suggest that. thanks [15:41:42] not about the stupid part, about the revert part btw [15:42:09] (03PS1) 10Andrew Bogott: Revert "Rsync: Unquote booleans" [puppet] - 10https://gerrit.wikimedia.org/r/242166 [15:42:15] :) [15:45:54] (03PS2) 10Andrew Bogott: Revert "Rsync: Unquote booleans" [puppet] - 10https://gerrit.wikimedia.org/r/242166 [15:46:45] "" eval'ing to true is a ruby thing I guess [15:47:08] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for sbisson - https://phabricator.wikimedia.org/T113676#1685317 (10RobH) So @ottomata linked me to https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups. Plus @Krenair (correctly) suggests (on T113680 that researchers is the n... [15:47:13] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for etonkovidova - https://phabricator.wikimedia.org/T113680#1673053 (10RobH) So @ottomata linked me to https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups. Plus @Krenair (correctly) suggests (on T113680 that researchers is... [15:47:19] (03CR) 10Andrew Bogott: [C: 032] Revert "Rsync: Unquote booleans" [puppet] - 10https://gerrit.wikimedia.org/r/242166 (owner: 10Andrew Bogott) [15:52:33] chasemp: yep. only `false` and `nil` are false in ruby [15:52:52] !log ending VCL tests on cp1065, starting on cp1053 instead [15:52:53] an empty string is still a String object, no it's truthy [15:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:01] *so* it's truthy [15:54:19] it's not crazy I suppose, I've heard perf reasonings which seem a little overbaked but "if you want to check for empty string then do so" makes sense to me [15:55:45] (03PS1) 10RobH: adding user etonkovidova [puppet] - 10https://gerrit.wikimedia.org/r/242168 [15:56:30] (03CR) 10RobH: [C: 032] "This is a new patchset, but the 3 day wait on the referenced task has passed without issue." [puppet] - 10https://gerrit.wikimedia.org/r/242168 (owner: 10RobH) [15:58:41] !log ending VCL tests on cp1053 [15:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:58:51] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:00:05] godog jynus: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150929T1600). Please do the needful. [16:00:05] Krenair: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:22] hey [16:00:41] hey [16:00:53] So I don't think this module I'm changing is actually used in prod. [16:01:13] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for etonkovidova - https://phabricator.wikimedia.org/T113680#1685376 (10RobH) 5Open>3Resolved Ok, Confirmed @Elena signed L3 & no one has raised any objections to the proposed access. As the access isn't sudo, but a typical level, and the 3... [16:01:27] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for sbisson - https://phabricator.wikimedia.org/T113676#1685378 (10RobH) a:5SBisson>3RobH [16:02:13] Krenair, is it for beta? [16:02:21] it's for shinken in labs [16:05:02] yes, cannot find it anywhere outside of tool/labs [16:05:05] (03PS5) 10Andrew Bogott: interface: dequote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241237 (https://phabricator.wikimedia.org/T113783) [16:05:07] (03PS5) 10Andrew Bogott: Cassandra: dequote some booleans. [puppet] - 10https://gerrit.wikimedia.org/r/241238 (https://phabricator.wikimedia.org/T113783) [16:05:09] (03PS5) 10Andrew Bogott: Grafana: dequote booleans. [puppet] - 10https://gerrit.wikimedia.org/r/241239 (https://phabricator.wikimedia.org/T113783) [16:05:12] (03PS5) 10Andrew Bogott: Gerrit role: dequote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241240 (https://phabricator.wikimedia.org/T113783) [16:05:13] (03PS5) 10Andrew Bogott: Mark salt grain bool values with # lint:ignore:quoted_booleans [puppet] - 10https://gerrit.wikimedia.org/r/241241 (https://phabricator.wikimedia.org/T113783) [16:05:15] (03PS5) 10Andrew Bogott: webserver::php5 unquote a boolean. [puppet] - 10https://gerrit.wikimedia.org/r/241242 (https://phabricator.wikimedia.org/T113783) [16:05:17] (03PS5) 10Andrew Bogott: Webserver ca: disable the quoted-bool lint check [puppet] - 10https://gerrit.wikimedia.org/r/241243 (https://phabricator.wikimedia.org/T113783) [16:05:19] (03PS5) 10Andrew Bogott: Diamond: Turn off lint check for quoted bools. [puppet] - 10https://gerrit.wikimedia.org/r/241244 (https://phabricator.wikimedia.org/T113783) [16:05:22] (03PS5) 10Andrew Bogott: Disable quoted_boolean lint check around is_virtual refs. [puppet] - 10https://gerrit.wikimedia.org/r/241245 (https://phabricator.wikimedia.org/T113783) [16:05:23] (03PS1) 10Andrew Bogott: Dequote one more nrpe critical setting [puppet] - 10https://gerrit.wikimedia.org/r/242170 (https://phabricator.wikimedia.org/T113783) [16:05:25] (03PS1) 10Andrew Bogott: dataset: Remove needless quotes around a 'true' [puppet] - 10https://gerrit.wikimedia.org/r/242171 (https://phabricator.wikimedia.org/T113783) [16:05:27] (03PS1) 10Andrew Bogott: Change a few rsync params from true/false to yes/no [puppet] - 10https://gerrit.wikimedia.org/r/242172 [16:05:32] (03PS1) 10RobH: adding shell user sbisson [puppet] - 10https://gerrit.wikimedia.org/r/242173 [16:06:15] (03CR) 10RobH: [C: 032] "3 day wait on linked task has passed without objection" [puppet] - 10https://gerrit.wikimedia.org/r/242173 (owner: 10RobH) [16:07:08] (03CR) 10Dpatrick: [C: 031] admin: add dpatrick to statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/242163 (https://phabricator.wikimedia.org/T114119) (owner: 10John F. Lewis) [16:11:07] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for sbisson - https://phabricator.wikimedia.org/T113676#1685439 (10RobH) 5Open>3Resolved @sbisson signed the L3 and there have been no objections raised, so access has been merged. I've watched it go live on bast1001.wikimedia.org & stat100... [16:12:16] 6operations, 7Monitoring: Switch Icinga from smsglobal - https://phabricator.wikimedia.org/T106589#1685454 (10RobH) 5Open>3Resolved So we switched to AQL and off smsglobal yesterday. I'm still in evaluations for alternative vendors, but this particular task is now done. [16:12:31] (03CR) 10Jcrespo: [C: 031] shinken: Fix inclusion of labs proxyagent password in shinkengen [puppet] - 10https://gerrit.wikimedia.org/r/241526 (owner: 10Alex Monk) [16:12:42] 6operations, 7Monitoring: Switch Icinga from smsglobal - https://phabricator.wikimedia.org/T106589#1685461 (10RobH) [16:13:15] 6operations, 7Monitoring: Switch Icinga from smsglobal - https://phabricator.wikimedia.org/T106589#1472363 (10RobH) [16:13:16] 6operations: Investigate smsglobal delivery failures from 2015-06-13 weekend - https://phabricator.wikimedia.org/T102396#1685464 (10RobH) 5stalled>3declined No more investigation, just migrated away from them as a vendor instead. [16:13:16] (03PS1) 10John F. Lewis: wikistats: remove Orain [debs/wikistats] - 10https://gerrit.wikimedia.org/r/242176 [16:13:19] godog, anything against it? [16:13:51] it should not work at all as it is now [16:14:04] !log reverted configuration hotfix from yesterday's Parsoid deploy (re-enabled use of Parsoid batching API) [16:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:41] 6operations, 6Labs, 10Labs-Infrastructure: install/setup labservices1001 - https://phabricator.wikimedia.org/T106584#1685467 (10RobH) [16:17:48] (03PS6) 10Giuseppe Lavagetto: Puppetize etcd use for eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/240916 (https://phabricator.wikimedia.org/T112688) (owner: 10Ottomata) [16:22:41] (03PS3) 10Jcrespo: shinken: Fix inclusion of labs proxyagent password in shinkengen [puppet] - 10https://gerrit.wikimedia.org/r/241526 (owner: 10Alex Monk) [16:24:23] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-116: New server: labservices1001 - https://phabricator.wikimedia.org/T106147#1685502 (10RobH) [16:24:39] 6operations, 6Labs, 10Labs-Infrastructure: rename holmium to labdns1002 - https://phabricator.wikimedia.org/T106303#1685508 (10RobH) [16:24:43] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-116: New server: labservices1001 - https://phabricator.wikimedia.org/T106147#1460393 (10RobH) 5Open>3Resolved a:3RobH labservices1001 has been allocated and setup via subtasks [16:25:12] 6operations, 6Labs, 10Labs-Infrastructure: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#1685519 (10RobH) [16:25:26] (03CR) 10Jcrespo: [C: 032] shinken: Fix inclusion of labs proxyagent password in shinkengen [puppet] - 10https://gerrit.wikimedia.org/r/241526 (owner: 10Alex Monk) [16:25:59] 6operations, 6Labs, 10Labs-Infrastructure: install/setup labservices1001 - https://phabricator.wikimedia.org/T106584#1472236 (10RobH) [16:26:45] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1685531 (10Joe) [16:26:47] 6operations, 7Tracking: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) (tracking) - https://phabricator.wikimedia.org/T65899#1685532 (10Joe) [16:26:49] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 7HHVM, 5Patch-For-Review: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1685529 (10Joe) 5Open>3Resolved [16:27:00] <_joe_> FINALLY [16:27:11] Krenair, merged, can you test it? [16:28:22] (03PS2) 10Giuseppe Lavagetto: wmnet: rename tmh1001 and 1002 mgmt interfaces [dns] - 10https://gerrit.wikimedia.org/r/242132 [16:28:46] I already made and tested it on shinken-ircbot-testing.shinken.eqiad.wmflabs [16:28:59] ah, ok :-) [16:29:40] (03CR) 10Giuseppe Lavagetto: [C: 032] wmnet: rename tmh1001 and 1002 mgmt interfaces [dns] - 10https://gerrit.wikimedia.org/r/242132 (owner: 10Giuseppe Lavagetto) [16:30:55] !log swapping failed disk db1050 [16:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:06] jynus, okay, I've run puppet on shinken-01, it definitely results in the correct password being written [16:31:11] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for etonkovidova - https://phabricator.wikimedia.org/T113680#1685555 (10RobH) [16:31:25] thanks [16:31:39] so that is done for today, i think [16:31:56] PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: puppet fail [16:32:12] 6operations, 6Labs, 10Labs-Infrastructure: install/setup labservices1001 - https://phabricator.wikimedia.org/T106584#1685561 (10RobH) [16:32:25] (03PS1) 10Muehlenhoff: Move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/242180 [16:33:02] 6operations, 6Labs, 10Labs-Infrastructure: install/setup labservices1001 - https://phabricator.wikimedia.org/T106584#1685566 (10RobH) a:5RobH>3Andrew I've done everything up to the signing puppet and salt keys for the initial puppet run(s). Assigning this task from myself to @andrew for service implemen... [16:35:11] 6operations, 10ops-eqiad: db1050 raid degraded - https://phabricator.wikimedia.org/T103110#1685579 (10Cmjohnson) We killed of es1005 and re-purposed one of the disks for db1050. The disk is in rebuild state Enclosure Device ID: 32 Slot Number: 5 Drive's position: DiskGroup: 0, Span: 2, Arm: 1 Enclosure posi... [16:36:13] ^yay! [16:36:51] what is es100x.eqiad.wmnet? Tons of unsigned salt keys for those on palladium [16:37:04] oh, nm, I see that you’re working on that right now :) [16:37:27] external storage databases? [16:37:31] maybe we can do the same for db1051? it is not an emergency, but it is blocking some pending work for enwiki [16:37:56] andrewbogott, those are docommed servers, that were on when I revoked the servers [16:38:17] have to give it a second pass to delete the keys again [16:39:35] (03CR) 10Muehlenhoff: [C: 032 V: 032] Various bugfixes / tweaks [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/242140 (owner: 10Muehlenhoff) [16:40:21] hey [16:40:55] 6operations, 6Phabricator, 6Project-Creators: create acl*operationsteam & acl*procurement projects, cease using #operations for access control - https://phabricator.wikimedia.org/T114135#1685603 (10RobH) 3NEW a:3RobH [16:42:10] _joe_: congratulations on the tmh conversion. :) [16:42:40] <_joe_> bd808: well the merit is mostly brion's :) [16:42:55] (03PS1) 10Alexandros Kosiaris: Add the new OTRS scheduler watchdog cron entry [puppet] - 10https://gerrit.wikimedia.org/r/242184 [16:43:00] <_joe_> bd808: I have also to deploy a package with all your patches [16:43:06] 6operations, 6Phabricator, 6Project-Creators: create acl*operationsteam & acl*procurement projects, cease using #operations for access control - https://phabricator.wikimedia.org/T114135#1685615 (10RobH) I'll note for future changing of the #operations group, we need to audit ALL tasks and ensure the policy... [16:43:31] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Analytics statistics-users access on stat1002 for dpatrick - https://phabricator.wikimedia.org/T114119#1685623 (10csteipp) Can @dpatrick also get access to statistics-privatedata-users too? Sorry I missed that the first time! [16:43:45] I was kind of excited when I figured out how to do some of the work for you there. [16:44:06] I'm not sure why it took me a year to figure that out [16:45:29] (03PS2) 10John F. Lewis: admin: add dpatrick to statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/242163 (https://phabricator.wikimedia.org/T114119) [16:45:39] (03PS3) 10John F. Lewis: admin: add dpatrick to statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/242163 (https://phabricator.wikimedia.org/T114119) [16:45:53] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Analytics statistics-users access on stat1002 for dpatrick - https://phabricator.wikimedia.org/T114119#1685631 (10JohnLewis) Added to patch. [16:49:40] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Analytics statistics-users access on stat1002 for dpatrick - https://phabricator.wikimedia.org/T114119#1685645 (10RobH) a:3RobH I've chatted with @csteipp about this already via IRC. Since this is a typical access request, we need to still wait the 3... [16:53:10] (03PS1) 10Andrew Bogott: Add labservices1001 as a holmium spare. [puppet] - 10https://gerrit.wikimedia.org/r/242187 (https://phabricator.wikimedia.org/T106142) [16:53:12] (03PS1) 10Andrew Bogott: Openstack/Designate: Add service monitoring on active designate host [puppet] - 10https://gerrit.wikimedia.org/r/242188 [16:53:16] 6operations, 7Pybal: pybal doesn't fully manage LVS table leaving stale services (on IP change) - https://phabricator.wikimedia.org/T114104#1685676 (10faidon) > This may be too much of an edge case to worry about seriously. It has happened before and has actually caused site issues in the past, so definitely... [16:57:00] (03PS2) 10Andrew Bogott: Add labservices1001 as a holmium spare. [puppet] - 10https://gerrit.wikimedia.org/r/242187 (https://phabricator.wikimedia.org/T106142) [16:57:48] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:03:19] PROBLEM - Check size of conntrack table on iron is CRITICAL: CRITICAL: nf_conntrack is 100 % full [17:05:48] on iron? [17:07:00] RECOVERY - Check size of conntrack table on iron is OK: OK: nf_conntrack is 72 % full [17:07:14] there is almost no connections on iron, strange [17:07:30] (03PS1) 10Alexandros Kosiaris: otrs: Update apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/242192 [17:10:51] jynus: which code review sorry? "anything against it?" [17:11:38] lol, godog you are late for puppet swat :-) [17:12:20] I will let you handle the Thursday one as a punishment :-P [17:12:59] jynus: doh! what a fail, I'll take Thurs for sure! [17:13:06] jynus: sorry about that [17:13:32] (03CR) 10Andrew Bogott: [C: 032] Add labservices1001 as a holmium spare. [puppet] - 10https://gerrit.wikimedia.org/r/242187 (https://phabricator.wikimedia.org/T106142) (owner: 10Andrew Bogott) [17:13:50] np, 1 patch, I could handle that without problem. I almost forget too, but IRC bot pinged me [17:16:30] PROBLEM - Check size of conntrack table on iron is CRITICAL: CRITICAL: nf_conntrack is 100 % full [17:16:32] 6operations, 10OTRS: Apply security patch to OTRS (Scheduler Process ID File Access vulnerability) - https://phabricator.wikimedia.org/T114132#1685761 (10Jgreen) p:5Triage>3High [17:16:59] (03CR) 10ArielGlenn: "I'm doing a pile of pylints (they'll show up in gerrit over the next day) on the active code, which is in the ariel branch. Once I'm done" [dumps] - 10https://gerrit.wikimedia.org/r/207504 (owner: 10Dereckson) [17:17:00] oh, now I can see it [17:19:15] 6operations, 10ops-eqiad: db1051 degraded raid (disk) - https://phabricator.wikimedia.org/T113786#1685771 (10Cmjohnson) Disk ordered Congratulations: Work Order SR917845784 was successfully submitted. [17:19:45] oh, it was only 49 and 50 that had that problem? [17:19:49] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 11 failures [17:20:19] RECOVERY - Check size of conntrack table on iron is OK: OK: nf_conntrack is 34 % full [17:23:19] 6operations, 10ops-eqiad: Replace failed PSU2 on mw1200 - https://phabricator.wikimedia.org/T114142#1685778 (10Cmjohnson) 3NEW a:3Cmjohnson [17:23:35] 6operations, 10ops-eqiad: Replace failed PSU2 on mw1200 - https://phabricator.wikimedia.org/T114142#1685787 (10Cmjohnson) Requested part from Dell Congratulations: Work Order SR917845971 was successfully submitted. [17:33:44] (03PS7) 10Ottomata: Puppetize etcd use for eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/240916 (https://phabricator.wikimedia.org/T112688) [17:35:13] test [17:35:29] 6operations, 7Pybal: pybal doesn't fully manage LVS table leaving stale services (on IP change) - https://phabricator.wikimedia.org/T114104#1685873 (10mark) It's trivial to have Pybal clear the ipvsadm table on startup of course, but I deemed that undesirable behaviour back in 2006 when I wrote it - when we st... [17:35:42] (03CR) 10Ottomata: [C: 032] Puppetize etcd use for eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/240916 (https://phabricator.wikimedia.org/T112688) (owner: 10Ottomata) [17:36:09] Is labs broken or is it just me? [17:37:12] sjoerddebruin: I could log into one of my VMs just now [17:37:31] Hm, weird. It's working again [17:37:32] I was just logged in, so... [17:39:38] (03PS1) 10Ottomata: Set default for $ssldir in etcd::ssl::base [puppet] - 10https://gerrit.wikimedia.org/r/242201 (https://phabricator.wikimedia.org/T112688) [17:40:34] (03CR) 10Ottomata: [C: 032] Set default for $ssldir in etcd::ssl::base [puppet] - 10https://gerrit.wikimedia.org/r/242201 (https://phabricator.wikimedia.org/T112688) (owner: 10Ottomata) [17:40:41] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [17:45:20] ^there is a host defined twice [17:45:25] (03PS2) 10Andrew Bogott: Openstack/Designate: Add service monitoring on active designate host [puppet] - 10https://gerrit.wikimedia.org/r/242188 [17:45:27] (03PS1) 10Andrew Bogott: Openstack: Remove service defs for services we do not want [puppet] - 10https://gerrit.wikimedia.org/r/242202 [17:46:02] someone working with labs-recursor0? [17:46:13] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Analytics statistics-users access on stat1002 for dpatrick - https://phabricator.wikimedia.org/T114119#1685946 (10RobH) p:5Triage>3Normal [17:46:27] (03CR) 10Andrew Bogott: [C: 032] Openstack/Designate: Add service monitoring on active designate host [puppet] - 10https://gerrit.wikimedia.org/r/242188 (owner: 10Andrew Bogott) [17:46:49] (03PS1) 10Ottomata: Remove etcd package requirement from etcd::ssl::base [puppet] - 10https://gerrit.wikimedia.org/r/242203 (https://phabricator.wikimedia.org/T112688) [17:46:55] !log size of conntrack table on iron might be increased due to test scans [17:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:47:21] (03PS1) 10Alexandros Kosiaris: otrs: add missing perl modules for OTRS 4.0.13 [puppet] - 10https://gerrit.wikimedia.org/r/242205 [17:47:22] andrewbogott: ^ what jynus wrote about labs-recursor0 [17:47:35] neon is complaining about icinga config errors [17:48:59] akosiaris, jynus, that’s me, I’ll look as soon as I’m able to merge my monitoring patch... [17:51:28] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Zhou Zhou - https://phabricator.wikimedia.org/T113325#1686017 (10RobH) @ZhouZ, Sorry to request more info from you, but we need a few more things for this to be accomplished. Your access request is to be added to the analytics-privatedata-... [17:52:04] (03PS2) 10Ottomata: Remove etcd package requirement from etcd::ssl::base [puppet] - 10https://gerrit.wikimedia.org/r/242203 (https://phabricator.wikimedia.org/T112688) [17:52:10] (03CR) 10Ottomata: [C: 032 V: 032] Remove etcd package requirement from etcd::ssl::base [puppet] - 10https://gerrit.wikimedia.org/r/242203 (https://phabricator.wikimedia.org/T112688) (owner: 10Ottomata) [17:54:08] 10Ops-Access-Requests, 6operations: add spage to analytics-privatedata-users group for hive access - https://phabricator.wikimedia.org/T114150#1686048 (10Spage) 3NEW [17:55:01] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Zhou Zhou - https://phabricator.wikimedia.org/T113325#1686068 (10ZhouZ) Sure, what's the best way for him (Stephen Laporte) to approve? [17:56:39] RECOVERY - Disk space on labstore1002 is OK: DISK OK [17:59:00] (03PS1) 10Ottomata: Ensure /var/lib/etcd exists even if etcd package is not installed [puppet] - 10https://gerrit.wikimedia.org/r/242207 (https://phabricator.wikimedia.org/T112688) [17:59:10] (03PS2) 10Ottomata: Ensure /var/lib/etcd exists even if etcd package is not installed [puppet] - 10https://gerrit.wikimedia.org/r/242207 (https://phabricator.wikimedia.org/T112688) [17:59:48] 6operations, 10ops-eqiad: Replace failed PSU on wtp1021 - https://phabricator.wikimedia.org/T114151#1686102 (10Cmjohnson) 3NEW a:3Cmjohnson [18:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150929T1800). Please do the needful. [18:00:04] (03PS3) 10Ottomata: Ensure /var/lib/etcd exists even if etcd package is not installed [puppet] - 10https://gerrit.wikimedia.org/r/242207 (https://phabricator.wikimedia.org/T112688) [18:01:33] (03CR) 10Ottomata: [C: 032] Ensure /var/lib/etcd exists even if etcd package is not installed [puppet] - 10https://gerrit.wikimedia.org/r/242207 (https://phabricator.wikimedia.org/T112688) (owner: 10Ottomata) [18:02:25] (03PS1) 10Andrew Bogott: Set up DNS for alternate labs DNS servers. [dns] - 10https://gerrit.wikimedia.org/r/242210 [18:03:24] (03PS1) 10Jcrespo: Adding wikiuser and wikiadmin grants to read the heartbeat table [puppet] - 10https://gerrit.wikimedia.org/r/242211 [18:05:31] (03PS1) 10Cmjohnson: DNS entries for elastic1006 and elastic1031 --swapping rows [dns] - 10https://gerrit.wikimedia.org/r/242212 [18:05:57] I'm getting "502 Bad Gateway nginx" from https://translatewiki.net , is it just me? (Is that server even operated by WMF?) [18:06:27] (03PS1) 10Ottomata: Apparently dirname isn't available, set $vardir and ensure it exists for etcd [puppet] - 10https://gerrit.wikimedia.org/r/242213 (https://phabricator.wikimedia.org/T112688) [18:07:01] (03CR) 10jenkins-bot: [V: 04-1] Apparently dirname isn't available, set $vardir and ensure it exists for etcd [puppet] - 10https://gerrit.wikimedia.org/r/242213 (https://phabricator.wikimedia.org/T112688) (owner: 10Ottomata) [18:07:40] (03PS2) 10Ottomata: Apparently dirname isn't available, set $vardir and ensure it exists for etcd [puppet] - 10https://gerrit.wikimedia.org/r/242213 (https://phabricator.wikimedia.org/T112688) [18:08:52] (03PS3) 10Ottomata: Apparently dirname isn't available, set $vardir and ensure it exists for etcd [puppet] - 10https://gerrit.wikimedia.org/r/242213 (https://phabricator.wikimedia.org/T112688) [18:09:28] !log shutting down elastic1006 to relocate row/rack [18:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:09:54] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Zhou Zhou - https://phabricator.wikimedia.org/T113325#1686141 (10RobH) @ZhouZ, Also, in attempting to dig up your wikitech credentials, I find your username on wikitech shell (labs) is zhousquared, is that correct? [18:10:22] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Zhou Zhou - https://phabricator.wikimedia.org/T113325#1686149 (10RobH) @SLaporte is no stranger to Phabricator, typically he can simply comment directly on this task. [18:11:00] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: puppet fail [18:11:21] !log shutting down elastic1031 to relocate rack/row [18:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:11:48] (03CR) 10Cmjohnson: [C: 032] DNS entries for elastic1006 and elastic1031 --swapping rows [dns] - 10https://gerrit.wikimedia.org/r/242212 (owner: 10Cmjohnson) [18:13:40] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: puppet fail [18:14:05] (03CR) 10Ottomata: [C: 032] Apparently dirname isn't available, set $vardir and ensure it exists for etcd [puppet] - 10https://gerrit.wikimedia.org/r/242213 (https://phabricator.wikimedia.org/T112688) (owner: 10Ottomata) [18:14:46] (03PS1) 10RobH: shell/stat1002 access for Zhou Zhou [puppet] - 10https://gerrit.wikimedia.org/r/242219 [18:15:52] (03CR) 10RobH: [C: 04-2] "My vote is technically +2, but this cannot merge until this upcoming Friday (due to 3 day wait of access requests for review.)" [puppet] - 10https://gerrit.wikimedia.org/r/242163 (https://phabricator.wikimedia.org/T114119) (owner: 10John F. Lewis) [18:17:19] (03PS2) 10RobH: shell/stat1002 access for Zhou Zhou [puppet] - 10https://gerrit.wikimedia.org/r/242219 [18:17:28] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Zhou Zhou - https://phabricator.wikimedia.org/T113325#1686193 (10Slaporte) Approved. [18:17:44] (03CR) 10Aaron Schulz: [C: 031] Adding wikiuser and wikiadmin grants to read the heartbeat table [puppet] - 10https://gerrit.wikimedia.org/r/242211 (owner: 10Jcrespo) [18:18:55] (03CR) 10coren: [C: 031] "Big improvement in clarity." [puppet] - 10https://gerrit.wikimedia.org/r/242039 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [18:19:28] (03CR) 10RobH: [C: 031] "I'm not sure about the use of the A record rather than a cname. I think either will work." (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/242210 (owner: 10Andrew Bogott) [18:24:42] (03PS2) 10Andrew Bogott: Set up DNS for alternate labs DNS servers. [dns] - 10https://gerrit.wikimedia.org/r/242210 [18:25:12] (03CR) 10Andrew Bogott: [C: 032] Set up DNS for alternate labs DNS servers. [dns] - 10https://gerrit.wikimedia.org/r/242210 (owner: 10Andrew Bogott) [18:25:53] (03PS2) 10Yuvipanda: ldap: Rewrite ssh lookup script [puppet] - 10https://gerrit.wikimedia.org/r/242039 (https://phabricator.wikimedia.org/T114063) [18:26:18] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Rewrite ssh lookup script [puppet] - 10https://gerrit.wikimedia.org/r/242039 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [18:27:13] (03PS2) 10Andrew Bogott: Openstack: Remove service defs for services we do not want [puppet] - 10https://gerrit.wikimedia.org/r/242202 [18:27:15] (03PS1) 10Andrew Bogott: Associate the DNS servers on labsservices1001 with different IPs from those on holmium [puppet] - 10https://gerrit.wikimedia.org/r/242224 [18:28:04] (03PS2) 10Andrew Bogott: Associate the DNS servers on labsservices1001 with different IPs from those on holmium [puppet] - 10https://gerrit.wikimedia.org/r/242224 [18:28:39] (03CR) 10Andrew Bogott: [C: 032] Associate the DNS servers on labsservices1001 with different IPs from those on holmium [puppet] - 10https://gerrit.wikimedia.org/r/242224 (owner: 10Andrew Bogott) [18:28:51] (03PS3) 10Andrew Bogott: Openstack: Remove service defs for services we do not want [puppet] - 10https://gerrit.wikimedia.org/r/242202 [18:29:52] (03CR) 10Andrew Bogott: [C: 032] Openstack: Remove service defs for services we do not want [puppet] - 10https://gerrit.wikimedia.org/r/242202 (owner: 10Andrew Bogott) [18:30:34] 6operations, 7Database: TokuDB crashes frequently -consider upgrade it or search for alternative engines with similar features - https://phabricator.wikimedia.org/T109069#1686251 (10jcrespo) Twice happened on frwikisource & enwikisource, table user_properties. [18:30:42] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Zhou Zhou - https://phabricator.wikimedia.org/T113325#1686254 (10ZhouZ) @RobH Yes my instance shell account name is zhousquared my Wikitech username is still: ZZhou (WMF) [18:31:10] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: puppet fail [18:31:19] (03PS3) 10RobH: shell/stat1002 access for Zhou Zhou [puppet] - 10https://gerrit.wikimedia.org/r/242219 [18:31:42] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1686268 (10ArielGlenn) snapshot conversion to hhvm is blocked on an upstream bug; see the snapshot task blocking tasks. once that's fixed and either backported... [18:32:35] (03PS2) 10Yuvipanda: tools: Remove ldapspportlib use from toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/242044 (https://phabricator.wikimedia.org/T114063) [18:32:49] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:33:23] (03PS1) 10Ottomata: Parameterize owner, group and modes for etcd::ssl::base files [puppet] - 10https://gerrit.wikimedia.org/r/242229 (https://phabricator.wikimedia.org/T112688) [18:33:25] (03CR) 10RobH: [C: 032] shell/stat1002 access for Zhou Zhou [puppet] - 10https://gerrit.wikimedia.org/r/242219 (owner: 10RobH) [18:33:59] (03CR) 10jenkins-bot: [V: 04-1] Parameterize owner, group and modes for etcd::ssl::base files [puppet] - 10https://gerrit.wikimedia.org/r/242229 (https://phabricator.wikimedia.org/T112688) (owner: 10Ottomata) [18:36:21] 6operations, 10ops-eqiad: RMA Samsung EVO ssds - https://phabricator.wikimedia.org/T107326#1686290 (10Cmjohnson) 5Open>3Resolved Dell is going to credit our account the amount paid for the ssds to be used for our next server purchases. [18:37:08] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Zhou Zhou - https://phabricator.wikimedia.org/T113325#1686292 (10RobH) 5Open>3Resolved a:3RobH Well this is over a week old so any objections had plenty of time. Since this isn't a sudo request, just a normal one, the 3 days will suff... [18:37:34] (03PS2) 10Ottomata: Parameterize owner, group and modes for etcd::ssl::base files [puppet] - 10https://gerrit.wikimedia.org/r/242229 (https://phabricator.wikimedia.org/T112688) [18:38:30] 6operations, 10ops-eqiad, 5Patch-For-Review: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1686300 (10Cmjohnson) Relocated elastic1006/1031 to appropriate racks, updated switch cfg, racktables. Corrected DNS and updated /... [18:39:13] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Dan Foy - https://phabricator.wikimedia.org/T113324#1686302 (10RobH) This seems identical to T113325, just for Dan instead. As such, would @Slaporte be the manager to approve this access expansion as well? Once we have manager approval, th... [18:39:26] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Dan Foy - https://phabricator.wikimedia.org/T113324#1686306 (10RobH) a:3RobH [18:39:53] (03PS3) 10ArielGlenn: dumps: convert tabs to spaces for worker.py and related modules [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/241901 [18:43:57] (03PS1) 10RobH: setting up dan foy's shell access [puppet] - 10https://gerrit.wikimedia.org/r/242234 [18:44:20] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Dan Foy - https://phabricator.wikimedia.org/T113324#1686311 (10RobH) @DFoy, I've prepared the patchset for merge, I just need to have your manager's approval for this on task. [18:44:57] (03PS2) 10Dduvall: Checkout revs to repo-cache, link to repo [tools/scap] - 10https://gerrit.wikimedia.org/r/241684 (https://phabricator.wikimedia.org/T113107) (owner: 10Thcipriani) [18:45:02] (03CR) 1020after4: [C: 032] Update mediawiki version regex to support semantic version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228039 (https://phabricator.wikimedia.org/T67306) (owner: 1020after4) [18:45:17] apergos: what is the upstream bug? [18:45:18] (03PS3) 10Ottomata: Parameterize owner, group and modes for etcd::ssl::base files [puppet] - 10https://gerrit.wikimedia.org/r/242229 (https://phabricator.wikimedia.org/T112688) [18:45:24] (03CR) 10Ottomata: [C: 032 V: 032] Parameterize owner, group and modes for etcd::ssl::base files [puppet] - 10https://gerrit.wikimedia.org/r/242229 (https://phabricator.wikimedia.org/T112688) (owner: 10Ottomata) [18:45:59] ori: see the blocking task on the c=snapshot conversion ticket [18:46:05] (03Merged) 10jenkins-bot: Update mediawiki version regex to support semantic version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228039 (https://phabricator.wikimedia.org/T67306) (owner: 1020after4) [18:46:14] but in a nutshell compress.bzip2 not implemented in their bz2, their mistake [18:46:31] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [18:46:45] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1686328 (10VBaranetsky) Thanks @coren. I wanted to know if I could also be granted access to Stats for the same purpose. [18:47:24] apergos: cool. looks like sarah golemon already has a patch up for review [18:49:22] (03CR) 10Dduvall: [C: 032] Checkout revs to repo-cache, link to repo [tools/scap] - 10https://gerrit.wikimedia.org/r/241684 (https://phabricator.wikimedia.org/T113107) (owner: 10Thcipriani) [18:49:30] yep, I have that bug open in the browser so I can see the updates as they come in [18:49:47] thanks [18:49:56] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Dan Foy - https://phabricator.wikimedia.org/T113324#1686362 (10Slaporte) >>! In T113324#1686302, @RobH wrote: > As such, would @Slaporte be the manager to approve this access expansion as well? It's not me -- let's check with @Dfoy. [18:49:57] (03Merged) 10jenkins-bot: Checkout revs to repo-cache, link to repo [tools/scap] - 10https://gerrit.wikimedia.org/r/241684 (https://phabricator.wikimedia.org/T113107) (owner: 10Thcipriani) [18:50:18] I"m mostly not here, and in a few minutes definitely not here, so any more news will have to wait for tomorrow [18:50:34] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Dan Foy - https://phabricator.wikimedia.org/T113324#1686365 (10RobH) a:5RobH>3DFoy @DFoy, I've assigned this task back to you, please have your manager attach approval. Once we have that, you can assign it back to me, thanks! [18:52:01] PROBLEM - MariaDB disk space on db2011 is CRITICAL: DISK CRITICAL - free space: /srv 83103 MB (5% inode=99%) [18:52:27] that's me ^ [18:52:30] ignore please [18:52:38] (03PS1) 10Rush: phab: mod_status for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/242236 [18:52:47] 10Ops-Access-Requests, 6operations: add spage to analytics-privatedata-users group for hive access - https://phabricator.wikimedia.org/T114150#1686374 (10RobH) @Spage, So should we remove you from statistics-private-data? Adding you to analytics-privatedata-users will require your manager's approval on task.... [18:53:10] (03PS1) 10Daniel Kinzler: Avoid breaking full phabricator URLs [puppet] - 10https://gerrit.wikimedia.org/r/242237 [18:53:12] heh, i change back to us carriers [18:53:18] and now all my pages come from icinga as sender again [18:53:19] woot! [18:53:43] akosiaris: but it resulted in an excellent pager test for me so thank you =] (non sarcastic thank you) [18:55:28] (03PS5) 10Mforns: [WIP] Consume EventLogging validation logs from Logstash [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) [18:55:37] (03CR) 10Eevans: [C: 031] cassandra: WIP support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [18:55:42] (03CR) 10Mforns: [WIP] Consume EventLogging validation logs from Logstash (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) (owner: 10Mforns) [18:55:53] 6operations, 10ops-eqiad: label wmf4575 as labservices1001 - https://phabricator.wikimedia.org/T114158#1686404 (10RobH) 3NEW a:3Cmjohnson [18:57:31] (03CR) 10Paladox: [C: 031] Avoid breaking full phabricator URLs [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [18:57:51] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: convert tabs to spaces for worker.py and related modules [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/241901 (owner: 10ArielGlenn) [18:58:42] (03CR) 10Mforns: [C: 04-1] "Still WIP" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) (owner: 10Mforns) [19:00:05] bd808: Dear anthropoid, the time has come. Please deploy Update production grant review app with latest code and schema (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150929T1900). [19:00:24] (03PS1) 1020after4: symlinks for 1.27.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242240 [19:00:27] (03PS1) 10ArielGlenn: dumps: pylint cleanup, remove padding inside parens, square brackets [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/242241 [19:00:40] (03CR) 1020after4: [C: 032] symlinks for 1.27.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242240 (owner: 1020after4) [19:00:47] (03Merged) 10jenkins-bot: symlinks for 1.27.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242240 (owner: 1020after4) [19:01:52] RECOVERY - MariaDB disk space on db2011 is OK: DISK OK [19:02:09] !log twentyafterfour@tin Started scap: sync php-1.27.0-wmf.1 for validation on testwiki [19:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:02:46] woo, semantic versioning without major explosions, at least none so far [19:03:02] twentyafterfour: A Wikidata branch went missing :P [19:03:21] what branch went missing? [19:03:37] The wmf/1.27.0-wmf.1 branch I created yesterday got deleted [19:03:58] I re-created it now (using the commit hash from yesterday) and will manually update the submodule [19:04:31] I thought I was supposed to use wmf22 [19:05:12] the deleted branch was my doing, I thought it had been created accidentally by make-wmf-branch [19:05:12] No, only use the last branch, if we don't branch ourselves [19:05:18] oh [19:05:22] :P [19:05:42] (03PS2) 10Rush: phab: mod_status for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/242236 [19:07:00] twentyafterfour: The idea is to let us branch biweekly (or less often, if needed) [19:07:04] (03CR) 10Rush: [C: 032] phab: mod_status for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/242236 (owner: 10Rush) [19:07:09] hoo: I created https://phabricator.wikimedia.org/diffusion/MREL/browse/master/bin/git-current-branch ... it should find the right branch to use [19:07:14] twentyafterfour, hoo: the intent of the config change the the make branch config was that it should copy the old branch to the new branch if the new branch does not already exist [19:07:15] Yet it would select the latest branch on its own [19:07:46] twentyafterfour: https://gerrit.wikimedia.org/r/242242 [19:07:54] but the feature to 'keep previous branch' in make-wmf-branch is broken. I'm trying to fix it to use that script I linked above ^ [19:08:26] Ok, cool [19:08:58] jzerebecki: it's unfortunately not working. make-wmf-branch did support that at one point but apparently not when branching from master. That was some really old cruft in the script that doesn't work as advertised. But I'm fixing it :) [19:09:05] 6operations, 6Phabricator, 6Project-Creators: Create policy projects and convert people projects to open - https://phabricator.wikimedia.org/T90491#1686484 (10Aklapper) [19:10:11] hoo: The desire is to continue using whatever branch you create, right? make-wmf-branch should never create a new wikidata branch on it's own, just always use whatever is newest each tuesday [19:10:23] Exactly :) [19:10:33] got it [19:11:09] as long as sort -V works we're good. my rudimentary testing seems to confirm that it does work the way we need it to [19:11:52] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [19:13:15] hrmm.. icinga still broken.. didnt somebody ask earlier? [19:13:24] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: pylint cleanup, remove padding inside parens, square brackets [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/242241 (owner: 10ArielGlenn) [19:13:28] twentyafterfour, hoo: no when the branch doesn't already exists it should copy from the previous one instead of master [19:13:51] (03PS1) 10Andrew Bogott: Add db grants for the new designate/pdns server, labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/242244 (https://phabricator.wikimedia.org/T114159) [19:14:49] mutante: it was something of otto's I think? at the time [19:15:24] (03PS1) 10ArielGlenn: dumps: pylint, no spaces for keywd assign, no trailing spaces [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/242245 [19:16:02] jzerebecki: that would cause a proliferation of branches (which we already have everywhere else but I wish we didn't) [19:17:27] (03PS1) 10Giuseppe Lavagetto: etcd: fixup for preceding changes [puppet] - 10https://gerrit.wikimedia.org/r/242246 [19:17:40] (03PS1) 10Yuvipanda: admin: add chedasaurus to bastionly group as well [puppet] - 10https://gerrit.wikimedia.org/r/242247 [19:17:56] (03PS2) 10Yuvipanda: admin: add chedasaurus to bastionly group as well [puppet] - 10https://gerrit.wikimedia.org/r/242247 [19:17:57] twentyafterfour: i just want to make wikidata more like every other extension, having a deploy branch for every core deploy branch is a step. so if you change everything else to not proliferate... we could do that instead. [19:19:42] (03CR) 10Yuvipanda: [C: 032 V: 032] admin: add chedasaurus to bastionly group as well [puppet] - 10https://gerrit.wikimedia.org/r/242247 (owner: 10Yuvipanda) [19:20:36] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: pylint, no spaces for keywd assign, no trailing spaces [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/242245 (owner: 10ArielGlenn) [19:21:25] (03CR) 10Jcrespo: "Sorry to be picky, but can you put a small comment (--) with the name of each machine (later it will be easier for both of us to sp" [puppet] - 10https://gerrit.wikimedia.org/r/242244 (https://phabricator.wikimedia.org/T114159) (owner: 10Andrew Bogott) [19:21:47] (03PS2) 10Giuseppe Lavagetto: etcd: fixup for preceding changes [puppet] - 10https://gerrit.wikimedia.org/r/242246 [19:21:51] jzerebecki: I'm not too concerned either way but shouldn't we have the consensus of the wikidata team to make such a change? If you have already came to that decision then I'm fine with it, but make-wmf-branch is very inflexible and hard to fix so I need to know which way it should behave in order to fix the script [19:22:21] that is to say, I'd rather not fix it twice ;) [19:23:34] (03PS3) 10Giuseppe Lavagetto: etcd: fixup for preceding changes [puppet] - 10https://gerrit.wikimedia.org/r/242246 [19:23:52] <_joe_> ottomata: I'm merging this now ^^ [19:24:17] <_joe_> should be a noop on the etcd servers, if it breaks the eventlogging ones you can fix it ;) [19:24:52] 6operations: Do not require people to be explicitly added to the bastiononly group - https://phabricator.wikimedia.org/T114161#1686574 (10yuvipanda) 3NEW [19:26:01] twentyafterfour: that is the consens. I just rechecked with key people to make sure. [19:26:26] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: fixup for preceding changes [puppet] - 10https://gerrit.wikimedia.org/r/242246 (owner: 10Giuseppe Lavagetto) [19:26:37] (03CR) 10Giuseppe Lavagetto: [V: 032] etcd: fixup for preceding changes [puppet] - 10https://gerrit.wikimedia.org/r/242246 (owner: 10Giuseppe Lavagetto) [19:26:38] _joe_: looking! [19:27:09] _joe_: should this be [19:27:10] https://${etcd_hosts}?cert=/var/lib/puppet/ssl/certs/ca.pem" [19:27:17] https://${etcd_hosts}?cert=/etc/eventlogging.d/ssl/certs/ca.pem" [19:27:18] ? [19:27:34] joal: this isn't going to work [19:27:36] etcd user doesn't exist [19:27:53] OH [19:27:54] sorry. [19:27:55] got it. [19:28:01] you just moved it out of the class, and aren't including it anymore [19:28:01] (03PS2) 10Andrew Bogott: Add db grants for the new designate/pdns server, labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/242244 (https://phabricator.wikimedia.org/T114159) [19:28:02] oops [19:28:08] s/joal/_joe_ [19:28:13] PROBLEM - puppet last run on mw2032 is CRITICAL: CRITICAL: Puppet has 1 failures [19:28:14] <_joe_> ottomata: yep [19:28:24] OooK [19:28:25] :/ [19:28:34] <_joe_> ottomata: and next thing I wanna do is include the puppet CA in our trusted stores from the system [19:29:15] aye [19:30:38] (03PS3) 10Jcrespo: Add db grants for the new designate/pdns server, labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/242244 (https://phabricator.wikimedia.org/T114159) (owner: 10Andrew Bogott) [19:30:55] !log Updated iegreview.wikimedia.org to c3ac5e6 (Update to Twig 1.20.0) and applied latest schema changes [19:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:16] (03PS1) 10Giuseppe Lavagetto: etcd: fixup path [puppet] - 10https://gerrit.wikimedia.org/r/242250 [19:31:22] PROBLEM - Apache HTTP on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:32] PROBLEM - HHVM rendering on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:34] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] etcd: fixup path [puppet] - 10https://gerrit.wikimedia.org/r/242250 (owner: 10Giuseppe Lavagetto) [19:31:41] (03PS1) 10Ottomata: Fix eventlogging ssl paths for etcd [puppet] - 10https://gerrit.wikimedia.org/r/242251 (https://phabricator.wikimedia.org/T112688) [19:32:05] (03PS2) 10Ottomata: Fix eventlogging ssl paths for etcd [puppet] - 10https://gerrit.wikimedia.org/r/242251 (https://phabricator.wikimedia.org/T112688) [19:32:29] (03PS4) 10Jcrespo: Add db grants for the new designate/pdns server, labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/242244 (https://phabricator.wikimedia.org/T114159) (owner: 10Andrew Bogott) [19:33:34] (03CR) 10Jcrespo: [C: 032] Add db grants for the new designate/pdns server, labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/242244 (https://phabricator.wikimedia.org/T114159) (owner: 10Andrew Bogott) [19:33:46] (03PS3) 10Ottomata: Fix eventlogging ssl paths for etcd [puppet] - 10https://gerrit.wikimedia.org/r/242251 (https://phabricator.wikimedia.org/T112688) [19:33:52] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [19:33:58] !log twentyafterfour@tin Finished scap: sync php-1.27.0-wmf.1 for validation on testwiki (duration: 31m 49s) [19:34:00] (03CR) 10Ottomata: [C: 032 V: 032] Fix eventlogging ssl paths for etcd [puppet] - 10https://gerrit.wikimedia.org/r/242251 (https://phabricator.wikimedia.org/T112688) (owner: 10Ottomata) [19:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:34:14] jynus: ok to merge? [19:34:23] was about to ask the same thing [19:34:30] was on that screen too [19:34:32] so yes [19:34:38] k done [19:35:12] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: puppet fail [19:35:52] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: puppet fail [19:36:09] ^related to your change, otto? [19:36:51] (I do not mind, I am asking to discard it wasn't mine) [19:37:03] <_joe_> jynus: nope, related to mine, it is working now [19:37:12] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:37:21] :-) [19:37:39] <_joe_> ;) [19:37:43] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:38:40] hoo: Do you want me to update the wikidata submodule now before syncing to group0? [19:39:16] twentyafterfour: Yes, please [19:39:46] _joe_: can I ask for some help in betalabs with getting this ssl stuff to work? I think everything is properly set now. [19:39:55] but i currently can't connect. [19:40:03] 2015-09-29 19:39:12,777 (MainThread) Failed to get list of machines from https://deployment-conf03.deployment-prep.eqiad.wmflabs:2379/v2: SSLError(SSLError(336265225, '_ssl.c:355: error:140B0009:SSL routines:SSL_CTX_use_PrivateKey_file:PEM lib'),) [19:40:27] my etcd server settings are here [19:40:30] https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep [19:40:36] "etcd::host": deployment-conf03.deployment-prep.eqiad.wmflabs [19:40:36] "etcd::use_ssl": true [19:40:36] "etcd::use_client_certs": true [19:40:36] "etcd::peers_list": deployment-conf03=http://deployment-conf03.deployment-prep.eqiad.wmflabs:2380 [19:40:36] etcd_hosts: [19:40:36] - deployment-conf03.deployment-prep.eqiad.wmflabs [19:40:48] OH, http:// the problem there maybe? [19:41:08] should be "etcd::peers_list": deployment-conf03=https://deployment-conf03.deployment-prep.eqiad.wmflabs:2380 [19:41:08] ? [19:43:58] (03PS1) 10Yuvipanda: bastion: Use role based hiera for bastions [puppet] - 10https://gerrit.wikimedia.org/r/242254 [19:44:56] 6operations, 10Wikimedia-Apache-configuration, 5Patch-For-Review: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1686641 (10Dzahn) 5Open>3Resolved Hi @tli this is resolved now. The redirect changed and we also removed the old link from our caching servers. Some users might still... [19:45:17] (03CR) 10Yuvipanda: [C: 032] bastion: Use role based hiera for bastions [puppet] - 10https://gerrit.wikimedia.org/r/242254 (owner: 10Yuvipanda) [19:45:55] (03CR) 10Alex Monk: "Also needs to do hieradata/hosts/hooft.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/242254 (owner: 10Yuvipanda) [19:46:51] mutante: https://gerrit.wikimedia.org/r/242254 [19:47:00] <_joe_> ottomata: you can't connect to what? [19:47:11] _joe_: the etcd client can't connect using the cert i gave it [19:47:35] <_joe_> ottomata: did you set up the etcd cluster? [19:48:00] <_joe_> and no, the peer list could be omitted in this case since it's just one host [19:48:11] PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Puppet has 1 failures [19:48:18] <_joe_> etc::use_client_certs should be set to false though [19:49:30] HMM, ok. trying. [19:49:50] ah no [19:49:52] _joe_: i had to set that [19:50:04] because, i need it to listen on interface, not on localhost [19:50:06] so [19:50:09] "etcd::host": deployment-conf03.deployment-prep.eqiad.wmflabs [19:50:17] and if I set that, and it doesn't know what that is in peers list [19:50:20] it complains [19:50:23] <_joe_> ottomata, that is ok [19:50:27] pretty sure, ok, trying. [19:50:40] <_joe_> but the second thing, etc::use_client_certs should be set to false though still stands [19:50:46] k [19:50:49] shoudl I leave peers list there? [19:50:56] yeah Error: Could not retrieve catalog from remote server: Error 400 on SERVER: We need either the domain name for DNS discovery or an explicit peers list at /etc/puppet/modules/etcd/manifests/init.pp:56 on node deployment-conf03.deployment-prep.eqiad.wmflabs [19:51:10] (03PS1) 10Yuvipanda: bastion: Remove host-specific hiera data for hooft too [puppet] - 10https://gerrit.wikimedia.org/r/242263 [19:51:11] <_joe_> yes, leave the peers list there [19:51:17] https? [19:51:21] <_joe_> nope [19:51:24] k [19:51:25] <_joe_> http is ok [19:51:28] (03CR) 10Yuvipanda: [C: 032 V: 032] bastion: Remove host-specific hiera data for hooft too [puppet] - 10https://gerrit.wikimedia.org/r/242263 (owner: 10Yuvipanda) [19:51:29] Krenair: fixed [19:51:37] <_joe_> that's the list for intra-cluster communications [19:51:44] <_joe_> not for clients to connect to [19:51:53] Krenair: can you rebase your patch based on master now? and use the role based ones [19:52:01] <_joe_> ottomata: also, yuvipanda might be able to help you [19:52:09] yuvipanda, you mean on top of production? :) [19:52:11] <_joe_> it's a bit late here [19:52:50] Krenair: yes :) [19:52:56] _joe_: yes I'm sitting across him :) [19:52:58] I can help! [19:53:11] <_joe_> yuvipanda: thanks [19:53:20] oook, still upset, will work with yuvi [19:53:22] RECOVERY - puppet last run on mw2032 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [19:53:25] thanks _joe_ [19:53:36] <_joe_> ottomata: I would've loved being more helpful [19:53:55] <_joe_> but now I gotta watch a movie where SF is blown to pieces by an earthquake! [19:54:17] _joe_: lol, Dwayne "the rock" :) [19:54:40] <_joe_> mutante: actually, I'm gonna watch the other San andreas movie that came out just before [19:54:45] <_joe_> the cheap knock-off [19:54:59] oh, i didnt know there were 2 [19:55:32] haha [19:55:41] (03PS1) 10Andrew Bogott: Typo fix! labsservices1001 is at .117 not at .17 [puppet] - 10https://gerrit.wikimedia.org/r/242266 [19:55:46] (03CR) 10jenkins-bot: [V: 04-1] Typo fix! labsservices1001 is at .117 not at .17 [puppet] - 10https://gerrit.wikimedia.org/r/242266 (owner: 10Andrew Bogott) [19:56:00] <_joe_> mutante: "san andreas quake" vs "san andreas" [19:56:44] <_joe_> also, I think the rock is less horrible as an actor than most think [19:57:07] i just saw the one with Dwayne Johnson and thought it was bad, but this one you mention: [19:57:10] Ratings: 1.9/10 from 5,063 users [19:57:11] heh [19:57:31] <_joe_> mutante: yeah, I want to get to the bottom of it :P [19:57:34] :) [19:58:33] (03PS2) 10Andrew Bogott: Typo fix! labsservices1001 is at .117 not at .17 [puppet] - 10https://gerrit.wikimedia.org/r/242266 [20:01:51] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [20:01:54] (03CR) 10Yuvipanda: "Ok, so I've merged mutante's patch that splits the ops role out separately and moved the config to use role-based hiera." [puppet] - 10https://gerrit.wikimedia.org/r/227327 (owner: 10Alex Monk) [20:02:22] (03PS1) 10Ottomata: Use ca_cert for etcd client in eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/242270 (https://phabricator.wikimedia.org/T112688) [20:02:29] (03CR) 10jenkins-bot: [V: 04-1] Use ca_cert for etcd client in eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/242270 (https://phabricator.wikimedia.org/T112688) (owner: 10Ottomata) [20:02:38] (03PS2) 10Ottomata: Use ca_cert for etcd client in eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/242270 (https://phabricator.wikimedia.org/T112688) [20:02:56] (03CR) 10Ottomata: [C: 032 V: 032] Use ca_cert for etcd client in eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/242270 (https://phabricator.wikimedia.org/T112688) (owner: 10Ottomata) [20:03:42] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1564 bytes in 0.006 second response time [20:04:54] (03PS3) 10Andrew Bogott: Typo fix! labsservices1001 is at .117 not at .17 [puppet] - 10https://gerrit.wikimedia.org/r/242266 [20:05:01] _joe_: yuvipanda it works! thank you (wrong python client param, was specifying client cert, not ca_cert) [20:05:18] <_joe_> ah! [20:05:30] <_joe_> ottomata: I didn't check that, I warned ya :P [20:05:51] <_joe_> ok, off [20:06:00] laters! ty! [20:06:17] (03PS1) 10coren: Create toolserver_legacy module [puppet] - 10https://gerrit.wikimedia.org/r/242288 (https://phabricator.wikimedia.org/T114102) [20:06:19] (03CR) 10Andrew Bogott: [C: 032] Typo fix! labsservices1001 is at .117 not at .17 [puppet] - 10https://gerrit.wikimedia.org/r/242266 (owner: 10Andrew Bogott) [20:06:30] yuvipanda: This one should make you happy ^^ [20:07:00] (03CR) 10Yuvipanda: "Do not know much about the config but modules make me happy." [puppet] - 10https://gerrit.wikimedia.org/r/242288 (https://phabricator.wikimedia.org/T114102) (owner: 10coren) [20:07:21] RECOVERY - RAID on db1050 is OK: OK: optimal, 1 logical, 2 physical [20:07:57] what's the best way to run nrpe commands from scap? Should we query the nrpe daemon or should we ssh into the targets and run the checks directly? [20:09:59] (03PS2) 10coren: Create toolserver_legacy module [puppet] - 10https://gerrit.wikimedia.org/r/242288 (https://phabricator.wikimedia.org/T114102) [20:10:06] _joe_: ^ what do you think would be best? [20:10:26] oh I guess _joe_'s gone [20:11:21] (03CR) 10coren: [C: 032] "Mostly config move to module; and the substantive change can only break the host currently being fiddled with." [puppet] - 10https://gerrit.wikimedia.org/r/242288 (https://phabricator.wikimedia.org/T114102) (owner: 10coren) [20:12:11] Oh blah. What is palladium's new name again? [20:12:35] ? [20:12:40] still palladium [20:13:02] ... I thought it changed recently. Hm. Means something is broken because I can't go there anymore. [20:13:12] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:13:31] Oh duh, nevermind - it's my client that's broken. [20:13:38] * Coren needs moar cafeine. [20:16:37] 6operations, 10ops-eqiad: Failed disk analytics1021 Kafka Broker - https://phabricator.wikimedia.org/T109832#1686864 (10Cmjohnson) fedex tracking number 9611918239302649959580 [20:16:59] 6operations, 10ops-eqiad: label wmf4575 as labservices1001 - https://phabricator.wikimedia.org/T114158#1686867 (10Cmjohnson) 5Open>3Resolved Done [20:17:01] 6operations, 6Labs, 10Labs-Infrastructure: install/setup labservices1001 - https://phabricator.wikimedia.org/T106584#1686869 (10Cmjohnson) [20:17:46] 6operations, 10netops: setup new equinix out of band mgmt access - https://phabricator.wikimedia.org/T113771#1686872 (10Cmjohnson) Connected at the demarc label 20045997 and connected mr1 ge-0/0/5 [20:26:08] Krinkle: bast4001 (ulsfo) is also available now for devs [20:26:14] yay [20:26:49] groups are now in the roles [20:27:03] and we have 2 roles, "regular" bastion and ops bastion (iron) [20:27:14] yuvi merged 2 changes [20:29:26] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/240016/ has been merged, which separated the roles into regular bastion and ops bastion" [puppet] - 10https://gerrit.wikimedia.org/r/239023 (owner: 10Dzahn) [20:30:25] (03Abandoned) 10Dzahn: let non-root users also use bast4001 [puppet] - 10https://gerrit.wikimedia.org/r/239023 (owner: 10Dzahn) [20:33:44] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1686928 (10bd808) >>! In T110474#1578555, @Krenair wrote: > And from puppet: > manifests/role/iegreview.pp:... [20:37:24] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10MediaWiki-extensions-CentralNotice, 10Traffic: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1686950 (10Tgr) >>! In T114078#1685110, @Nuria wrote: > BTW, cc @trg as he had a ticket on t... [20:37:38] (03PS10) 10Alex Monk: Add all groups to general bastions, mostly empty bastiononly group [puppet] - 10https://gerrit.wikimedia.org/r/227327 [20:41:22] PROBLEM - Check size of conntrack table on iron is CRITICAL: CRITICAL: nf_conntrack is 98 % full [20:42:25] damn, that's me again, it was fine all the time [20:43:11] doesnt keep me from connecting though [20:43:59] 6operations, 7Mail: Remove Aliases in Exim Mail Routing Config [SemiUrgent] - https://phabricator.wikimedia.org/T114173#1686971 (10JKrauska) 3NEW [20:44:17] 6operations, 7Mail: Remove Aliases in Exim Mail Routing Config [SemiUrgent] - https://phabricator.wikimedia.org/T114173#1686978 (10JKrauska) [20:45:42] (03CR) 10GWicke: cassandra: WIP support for multiple instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [20:46:04] ACKNOWLEDGEMENT - Check size of conntrack table on iron is CRITICAL: CRITICAL: nf_conntrack is 99 % full daniel_zahn scan [20:47:38] ottomata: regarding aqs, I think pretty much everything is in place, aside from the actual LVS implementation which I can do tomorrow, assuming boxes are installed, working fine and the service on every one of them is operational [20:50:07] (03PS1) 10Ottomata: eventlogging now requires python-etcd [puppet] - 10https://gerrit.wikimedia.org/r/242338 (https://phabricator.wikimedia.org/T112688) [20:50:10] cool., pach is merged? [20:50:12] akosiaris: ? [20:50:25] boxes are ready [20:51:00] (03CR) 10Ottomata: [C: 032] eventlogging now requires python-etcd [puppet] - 10https://gerrit.wikimedia.org/r/242338 (https://phabricator.wikimedia.org/T112688) (owner: 10Ottomata) [20:51:30] hey, so who worked on labs-ns2 recently [20:52:52] twentyafterfour: Is the train delayed? testwiki is on 1.27 but mediawikiwiki isn't yet [20:53:01] * RoanKattouw is investigating a possible regression [20:53:45] mutante, andrewbogott or yuvipanda I'd imagine? [20:54:24] mutante: I’m working on it right now — what’s up? [20:56:00] andrewbogott: it breaks the icinga due to some duplicate definition [20:56:10] Error: Could not add object property in file '/etc/icinga/puppet_hosts.cfg' on line 6052. [20:56:11] mutante: still? I fixed that an hour ago [20:56:23] Warning: Duplicate definition found for host 'labs-ns2.wikimedia.org' [20:56:28] yea, maybe it came back after puppet run? [20:56:33] (03Restored) 10GWicke: Don't require nodejs for restbase [puppet] - 10https://gerrit.wikimedia.org/r/229304 (owner: 10GWicke) [20:56:35] (03CR) 10BryanDavis: [WIP] Consume EventLogging validation logs from Logstash (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) (owner: 10Mforns) [20:56:36] ok, I will look in a moment [20:56:57] andrewbogott: thanks [20:57:55] (03CR) 10GWicke: "To clarify, it is an earlier version of what became node 4.x. The advantage of using the iojs packages we already have for testing is that" [puppet] - 10https://gerrit.wikimedia.org/r/229304 (owner: 10GWicke) [21:00:14] (03CR) 10GWicke: "@akosiaris, we'd still like to test this. Could you reconsider your stance, or propose an alternative puppet change that would let us test" [puppet] - 10https://gerrit.wikimedia.org/r/229304 (owner: 10GWicke) [21:00:27] godog: yt? [21:00:31] q about our python-etcd package [21:01:20] (03PS1) 10Andrew Bogott: Move labs-recursor1 to 208.80.155.118. With luck that's in the right vlan. [dns] - 10https://gerrit.wikimedia.org/r/242345 [21:01:53] (03CR) 10Andrew Bogott: [C: 032] Move labs-recursor1 to 208.80.155.118. With luck that's in the right vlan. [dns] - 10https://gerrit.wikimedia.org/r/242345 (owner: 10Andrew Bogott) [21:18:34] mutante: puppet run is clean on neon… is this something intermittent, or am I misunderstanding the issue? [21:20:03] RECOVERY - Check size of conntrack table on iron is OK: OK: nf_conntrack is 47 % full [21:20:15] andrewbogott: the issue is that icinga service cant be restarted, so while a puppet run finishes we can't add new things to it or restart it [21:20:22] ottomata: ish, what's up? [21:20:27] there is this separate meta check for it [21:20:32] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=neon&service=Check+correctness+of+the+icinga+configuration [21:21:09] andrewbogott: root@neon:~# icinga -v /etc/icinga/icinga.cfg [21:21:19] try that, and you will see a config error [21:21:31] or try restarting the icinga service and it will try that too [21:23:04] puppet does not when it detects the config is broken [21:24:27] andrewbogott: for now i'd say let's remove the duplicate host manually from the config file, start the service again, then run puppet and see if the issue is back [21:27:49] godog: maybe this is not you, but setup.py in our package says version = '0.3.3' [21:28:00] but changelog says 0.4.0... [21:28:32] so, i'm thinking about rebuilding the package [21:28:43] the 0.4.0 tag in upstream has the proper setup.py version [21:29:00] can I just tag upstream/0.4.0 and then update changelog? [21:29:20] i'm not sure why our versions are like 0.4.0~git20150609+ac25bd7ba2-1 [21:29:29] perhaps they were built ona 0.4.0 dev branch instead of the tag? [21:30:09] ah, am finding https://github.com/jplana/python-etcd/commit/0adaa9e0f477bc2914471a95f7aad13c075c258d [21:30:16] nm, godog, it is _joe_, not you [21:30:19] i think I can rebuild with 0.4.0 [21:30:21] ottomata: ah yeah I haven't built that [21:30:33] you are maintainer in control so I pinged you :) [21:31:00] yargg so many conflicts [21:32:13] <_joe_> Ottomata: which conflicts? [21:32:40] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [21:32:45] ottomata: hah, that version that is (I think!) [21:32:50] PROBLEM - Host 208.80.154.21 is DOWN: CRITICAL - Host Unreachable (208.80.154.21) [21:33:07] <_joe_> ottomata: so, we built a package from a git version later than the 0.3.3 release [21:33:17] <_joe_> that's why we used that version number [21:33:54] <_joe_> now I plan to package a new deb when I release 0.4.2 with DNS based discovery and authentication [21:34:09] _joe_: aye, but it makes python deps unhappy, because the package does not satisfy them. [21:34:21] _joe_: can I rebuild from 0.4.0 tag now? [21:34:22] <_joe_> which python deps? [21:34:25] eventlogging [21:34:33] <_joe_> ottomata: 0.4.1 at least [21:34:35] unless i guess i just say >=0.3.3 [21:34:40] _joe_: , ok [21:34:49] <_joe_> just say >0.3.3 [21:34:57] Hm. ok. yeah ok. [21:35:03] it'll work with that version? [21:35:29] <_joe_> yep [21:36:00] ok, that is the easiest thing, to do, thanks [21:36:18] (03PS1) 10Andrew Bogott: Replace hardcoded 'labs-ns2.wikimedia.org' with a hiera call [puppet] - 10https://gerrit.wikimedia.org/r/242358 [21:37:27] (03CR) 10jenkins-bot: [V: 04-1] Replace hardcoded 'labs-ns2.wikimedia.org' with a hiera call [puppet] - 10https://gerrit.wikimedia.org/r/242358 (owner: 10Andrew Bogott) [21:38:45] (03PS2) 10Andrew Bogott: Replace hardcoded 'labs-ns2.wikimedia.org' with a hiera call [puppet] - 10https://gerrit.wikimedia.org/r/242358 [21:39:52] (03CR) 10Andrew Bogott: [C: 032] Replace hardcoded 'labs-ns2.wikimedia.org' with a hiera call [puppet] - 10https://gerrit.wikimedia.org/r/242358 (owner: 10Andrew Bogott) [21:45:26] (03CR) 10GWicke: "Actually, now that 4.1 is available in unstable we could also test by manually installing backports of the node 4.1 packages on some nodes" [puppet] - 10https://gerrit.wikimedia.org/r/229304 (owner: 10GWicke) [21:45:31] PROBLEM - Check size of conntrack table on iron is CRITICAL: CRITICAL: nf_conntrack is 94 % full [21:45:48] (03PS2) 10Dzahn: ganeti, swift: fix 'variable not enclosed' [puppet] - 10https://gerrit.wikimedia.org/r/242056 [21:47:47] _joe_: do you run this python client in prod? [21:47:50] now i'm getting [21:47:58] Searching for pyOpenSSL>=0.14 [21:48:05] which is a python-etcd dep [21:48:13] and [21:48:23] apt-cache show python-openssl | grep Version [21:48:24] Version: 0.13-2ubuntu6 [21:48:25] (in trusty) [21:48:31] hm checking jessie [21:48:42] AH [21:48:47] that is the stickler. ok. [21:48:56] ignore me, apologies for the ping, i know what to do. [21:51:34] PROBLEM - Check size of conntrack table on iron is CRITICAL: CRITICAL: nf_conntrack is 95 % full [21:52:38] 6operations, 7Mail: Remove Aliases in Exim Mail Routing Config [SemiUrgent] - https://phabricator.wikimedia.org/T114173#1687281 (10RobH) 5Open>3Resolved a:3RobH removed and committed change, as soon as mailserver puppet runs it'll be gone. [21:53:02] twentyafterfour: did you forget to run sync-wikiversions for group0? [21:53:31] mw.o is still on 1.26wmf24 [21:53:44] RoanKattouw is asking in -releng [21:54:04] <_joe_> ottomata: going to bed now, sorry [21:54:13] I asked here an hour ago as well [21:54:22] <_joe_> drop me an email if you need to [21:54:24] (03PS1) 10Jdlrobson: Add Extension:RelatedArticles to beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242362 (https://phabricator.wikimedia.org/T113770) [21:54:52] s'ok, thanks _joe_ [21:54:56] goodnight [21:54:59] I wonder if his network dropped or something. "has been idle for 1 hour, 44 minutes" [21:56:31] What's going on in s1? [21:56:46] enwiki just went into read-only mode [21:57:05] several DBs are lagged by a few minutes [21:57:33] jynus, ^ [21:58:34] RECOVERY - Check size of conntrack table on iron is OK: OK: nf_conntrack is 76 % full [21:58:40] Bleh, my fault. [21:58:42] ostriches, that's nothing to do with you, is it...? [21:58:42] Krenair: Huh? https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb looks fine [21:58:42] Should be back. [21:58:43] ah [21:59:21] RoanKattouw, yeah, they all just went back to 0 [21:59:32] Oh I guess I looked just to late [22:00:25] PROBLEM - MariaDB Slave Lag: s1 on db2042 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 459 [22:00:30] akosiaris: around? [22:00:38] No it isn't, get with the show icinga. [22:00:43] need some pdebuild help, trying to backport a package from jessie to trusty [22:02:14] RECOVERY - MariaDB Slave Lag: s1 on db2042 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [22:03:30] RoanKattouw, https://people.wikimedia.org/~krenair/s1_lag_2015-09-29_23-00.png [22:03:46] was what I saw earlier [22:04:01] Krenair: The problem with cleaning up user_preferences isn't the number of rows to update...it's the number of rows selected to look at for updating. [22:04:02] ottomata, ori: meeting? [22:04:04] PROBLEM - Check size of conntrack table on iron is CRITICAL: CRITICAL: nf_conntrack is 100 % full [22:04:11] r53? [22:04:16] brt [22:04:24] here! [22:04:27] yep [22:04:50] Krenair: eg: https://phabricator.wikimedia.org/P2117 [22:05:27] * AaronSchulz is in the hangout [22:05:27] !log 22:03:51 Synchronized php-1.26wmf24/extensions/CentralNotice: 30bdfcb386: Updated mediawiki/core Project: mediawiki/extensions/CentralNotice 6bd658e155a02edb4cc506bc3494a3f4699d3e94 (duration: 00m 17s) [22:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:06:28] ostriches, I was cleaning out some preferences earlier and just used delete ... limit 500 in runBatchedQuery, which does wfWaitForSlaves [22:06:43] not quite sure what you were running though [22:06:47] ottomata: wait, are you there IRL? [22:06:57] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1687346 (10cscott) @bd808 the new hotness is to use the REST API's /transform/wikitext/to/html endpoint to do th... [22:07:23] Krenair: LIMIT on DELETE makes mysql yell at you. [22:07:55] Krenair: It wasn't even multiple queries, or a large batch. It was 16 rows. [22:08:12] oh, wow [22:08:51] Query OK, 16 rows affected (2 min 9.64 sec) <- From the master [22:09:11] Like I said, it's because we're selecting ~17m rows to look at for deleting. [22:09:39] would it have been quicker to select them all first and then delete the ones found? [22:09:48] with up_user etc. [22:10:55] Probably, lets see..... [22:10:56] RECOVERY - Check size of conntrack table on iron is OK: OK: nf_conntrack is 28 % full [22:11:14] you break it, you fix it [22:11:28] I did. [22:11:39] jynus: new catchphrase? ;) [22:14:50] Yeah given that we use statement-based replication, it's probably faster to run those selects on a slave, then run a query on the master that just deletes those (up_user,up_property) pairs [22:15:12] And avoid having to look at up_value at all, indeed. [22:15:48] Yeah in the delete you wouldn't have to, that would just be PK-based [22:16:10] (03PS1) 10Rush: phab: basic apache status metrics to graphite [puppet] - 10https://gerrit.wikimedia.org/r/242367 [22:17:17] (03CR) 10jenkins-bot: [V: 04-1] phab: basic apache status metrics to graphite [puppet] - 10https://gerrit.wikimedia.org/r/242367 (owner: 10Rush) [22:17:17] (03PS2) 10Rush: phab: basic apache status metrics to graphite [puppet] - 10https://gerrit.wikimedia.org/r/242367 [22:17:17] (03CR) 10jenkins-bot: [V: 04-1] phab: basic apache status metrics to graphite [puppet] - 10https://gerrit.wikimedia.org/r/242367 (owner: 10Rush) [22:18:16] (03PS3) 10Rush: phab: basic apache status metrics to graphite [puppet] - 10https://gerrit.wikimedia.org/r/242367 [22:19:13] PROBLEM - puppet last run on mw2075 is CRITICAL: CRITICAL: Puppet has 1 failures [22:22:36] (03PS1) 10Yuvipanda: dynamicproxy: Remove ::eqiad suffix [puppet] - 10https://gerrit.wikimedia.org/r/242371 [22:22:41] (03CR) 10Rush: [C: 032] phab: basic apache status metrics to graphite [puppet] - 10https://gerrit.wikimedia.org/r/242367 (owner: 10Rush) [22:23:04] ottomata: https://etherpad.wikimedia.org/p/KafkaPurge [22:23:23] !log unbanning elastic1006 for shard population [22:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:23:46] (03PS2) 10Yuvipanda: dynamicproxy: Remove ::eqiad suffix [puppet] - 10https://gerrit.wikimedia.org/r/242371 [22:24:10] anyone deploying right now? I need to finish pushing the new branch, it only made it to testwiki so far [22:24:52] bd808: You're the only one that has a window today, are you done? [22:25:15] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Remove ::eqiad suffix [puppet] - 10https://gerrit.wikimedia.org/r/242371 (owner: 10Yuvipanda) [22:26:37] doesn't seem to be anyone active on tin so I'm going for it [22:27:35] yeah. I finished a long time ago [22:28:03] (03CR) 10Tim Landscheidt: "I'm not sure if this is the reason, but if you look at "ldaplist -l hosts $hostname", for shinken-ircbot-testing role::labs::shinken is li" [puppet] - 10https://gerrit.wikimedia.org/r/241526 (owner: 10Alex Monk) [22:28:49] 22:27:17 sync-dir failed: /srv/mediawiki-staging/php-1.27.0-wmf.1/vendor/oyejorge/less.php/lib/Less/Version.php has content before opening (03PS11) 10Alex Monk: Add all groups to general bastions, mostly empty bastiononly group [puppet] - 10https://gerrit.wikimedia.org/r/227327 (https://phabricator.wikimedia.org/T114161) [22:29:12] wth... it sync'd before. Does a full-scap not run that check? hmmm [22:29:15] (03CR) 10Alex Monk: [C: 04-1] "TODO: "Figure out a solution that doesn't grant people root on bastion just because they have root in other places!"" [puppet] - 10https://gerrit.wikimedia.org/r/227327 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk) [22:29:31] no, it doesn't [22:29:56] !log twentyafterfour@tin Started scap: php-1.27.0-wmf.1/extensions/ [22:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:30:10] !log es-tool unban-node elastic1031 [22:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:30:33] !log 22:27:17 sync-dir failed: /srv/mediawiki-staging/php-1.27.0-wmf.1/vendor/oyejorge/less.php/lib/Less/Version.php has content before opening Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:32:12] 6operations, 10Dumps-Generation: sql dump schemata - seven tables should have their columns reordered - https://phabricator.wikimedia.org/T103583#1687508 (10wpmirrordev) For @jcrespo: 0) Context I am the author of https://www.mediawiki.org/wiki/Wp-mirror, which is a utility for building a mirror farm of wik... [22:34:25] !log twentyafterfour@tin Finished scap: php-1.27.0-wmf.1/extensions/ (duration: 04m 29s) [22:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:34:35] ottomata: is our puppet code for kafka fairly general for making new clusters like this or are there a bunch of use-case specific assumptions? [22:35:12] (03PS1) 1020after4: group0 wikis to 1.27.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242376 [22:35:44] (03CR) 1020after4: [C: 032] group0 wikis to 1.27.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242376 (owner: 1020after4) [22:35:51] (03Merged) 10jenkins-bot: group0 wikis to 1.27.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242376 (owner: 1020after4) [22:36:08] 6operations, 5Patch-For-Review: Do not require people to be explicitly added to the bastiononly group - https://phabricator.wikimedia.org/T114161#1687516 (10Krenair) The main challenge with this is ensuring that while (almost?) all groups imply bastion access, only ops gets their sudo rules applied on bastions... [22:36:29] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.27.0-wmf.1 [22:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:38:49] !log finished deployment of 1.27.0-wmf.1 to group0 [22:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:40:20] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10MediaWiki-extensions-CentralNotice, 10Traffic: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1687538 (10awight) For more context, we're sending data very occasionally, one message on fa... [22:40:35] (03PS1) 10ArielGlenn: dumps: pylint, remove superfluous parens, ensure space after commas [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/242380 [22:41:30] Krenair, RoanKattouw: https://gerrit.wikimedia.org/r/#/c/242374/ should do it nicer. [22:42:10] what about the nostalgia skin on nostalgiawiki? [22:42:23] Doesn't matter what prefs people have set there. [22:42:39] does that wiki even allow login? [22:43:15] There are 6 local users there total [22:43:27] It's not sul'd lol [22:44:15] Almost all system accounts, with the exception of JeLuF, who was a shell user [22:44:39] twentyafterfour: we just no bumped that less.php library to a version that doesn't have the lint failure. Should be all better next week [22:44:41] 2 skin property rows, both set to vector as their skin. [22:44:48] s/no/now/ [22:44:51] 6operations, 10ops-eqiad, 5Patch-For-Review: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1687589 (10chasemp) a:3chasemp [22:45:23] (03PS3) 10Dzahn: ganeti, swift: fix 'variable not enclosed' [puppet] - 10https://gerrit.wikimedia.org/r/242056 [22:46:20] 10Ops-Access-Requests, 6operations: add spage to analytics-privatedata-users group for hive access - https://phabricator.wikimedia.org/T114150#1687599 (10Spage) >>! In T114150#1686374, @RobH wrote: > @Spage, > > So should we remove you from statistics-private-data? No, I need both. > Adding you to analytics-... [22:46:20] RECOVERY - puppet last run on mw2075 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:47:21] 10Ops-Access-Requests, 6operations: add spage to analytics-privatedata-users group for hive access - https://phabricator.wikimedia.org/T114150#1687601 (10JKatzWMF) approved! [22:49:53] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1687605 (10awight) Everything Jgreen is saying is true ;) there's no effect on existing banner impression counts... [22:51:38] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: pylint, remove superfluous parens, ensure space after commas [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/242380 (owner: 10ArielGlenn) [22:52:40] so apparently my skin was reset to Vector (from MonoBook) on MW.org... [22:52:57] ostriches: Krenair ^ related? [22:53:23] (03PS4) 10Dzahn: ganeti, swift: fix 'variable not enclosed' [puppet] - 10https://gerrit.wikimedia.org/r/242056 [22:53:39] Erm, that's not right... [22:53:44] AaronSchulz: it should be pretty general [22:53:51] I shouldn't have touched any monobook rows... [22:53:54] AaronSchulz: I will write some docs right now for ya... [22:54:17] nice, I just filed the task [22:54:32] 6operations, 10MediaWiki-Cache, 6Performance-Team, 7Availability: Setup a 3 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1687627 (10aaron) [22:55:28] bd808: thanks [22:55:45] 6operations, 10MediaWiki-Cache, 6Performance-Team, 7Availability: Setup a 3 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1687628 (10aaron) [22:56:39] 6operations, 10MediaWiki-Cache, 6Performance-Team, 7Availability: Setup a 3 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1687618 (10aaron) [22:59:09] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/934/" [puppet] - 10https://gerrit.wikimedia.org/r/242056 (owner: 10Dzahn) [23:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150929T2300). [23:00:05] matt_flaschen RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:18] Here [23:00:37] I'll do it [23:01:45] (03PS2) 10Dzahn: lint: fix 'variable not enclosed' warnings [puppet] - 10https://gerrit.wikimedia.org/r/242055 [23:01:55] (03PS3) 10Dzahn: lint: fix 'variable not enclosed' warnings [puppet] - 10https://gerrit.wikimedia.org/r/242055 [23:02:20] (03PS2) 10Dzahn: lint: fix 'variable not enclosed' pt2 [puppet] - 10https://gerrit.wikimedia.org/r/242057 [23:05:41] AaronSchulz: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Kafka [23:05:51] oops [23:05:53] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Kafka#In_Labs [23:06:30] (03PS1) 10ArielGlenn: worker.py: fix indentation issues, many camelcase names [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/242396 [23:06:41] (03CR) 10Dzahn: "i would follow qchris' recommendation" [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) (owner: 10Ricordisamoa) [23:07:03] thanks [23:11:00] (03CR) 10Dzahn: "@csteipp this is the one we talked about" [puppet] - 10https://gerrit.wikimedia.org/r/197081 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [23:14:10] (03CR) 10Dzahn: "on hold since 2014, bump" [dns] - 10https://gerrit.wikimedia.org/r/120999 (owner: 10ArielGlenn) [23:16:13] (03CR) 10Dzahn: "only after all the existing issues have been fixed or the special comment has been added to disable the checks" [puppet] - 10https://gerrit.wikimedia.org/r/241111 (https://phabricator.wikimedia.org/T113783) (owner: 10Hashar) [23:19:44] (03CR) 10Dzahn: "still needs another fix, see the error when starting ferm on mira" [puppet] - 10https://gerrit.wikimedia.org/r/240083 (owner: 10Muehlenhoff) [23:22:14] PROBLEM - puppet last run on db1071 is CRITICAL: CRITICAL: Puppet has 1 failures [23:24:36] (03PS1) 10Yuvipanda: dynamicproxy: Build proper invisible-unicorn package [puppet] - 10https://gerrit.wikimedia.org/r/242402 [23:25:24] (03PS2) 10Yuvipanda: dynamicproxy: Build proper invisible-unicorn package [puppet] - 10https://gerrit.wikimedia.org/r/242402 [23:25:24] 6operations, 10Wikimedia-Apache-configuration, 5Patch-For-Review: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1687695 (10tli) Thank, Dzahn! [23:26:18] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Build proper invisible-unicorn package [puppet] - 10https://gerrit.wikimedia.org/r/242402 (owner: 10Yuvipanda) [23:29:19] (03PS1) 10Yuvipanda: dynamicproxy: Use new servicename for API [puppet] - 10https://gerrit.wikimedia.org/r/242404 [23:30:11] (03PS2) 10Yuvipanda: dynamicproxy: Use new servicename for API [puppet] - 10https://gerrit.wikimedia.org/r/242404 [23:30:21] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Use new servicename for API [puppet] - 10https://gerrit.wikimedia.org/r/242404 (owner: 10Yuvipanda) [23:30:24] !log catrope@tin Synchronized php-1.27.0-wmf.1/extensions/GuidedTour/GuidedTourHooks.php: T114144 Fix back button logging (duration: 00m 17s) [23:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:31] matt_flaschen: ---^^ (sorry for the delay) [23:31:03] RoanKattouw, it's okay. Just had to be done before enabling opt-in. [23:31:54] 6operations, 10Wikimedia-Apache-configuration: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1687700 (10Dzahn) [23:32:07] !log catrope@tin Synchronized php-1.27.0-wmf.1/extensions/Echo/: SWAT (duration: 00m 17s) [23:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:33:09] !log catrope@tin Synchronized php-1.26wmf24/extensions/GuidedTour/GuidedTourHooks.php: T114144 Fix back button logging (duration: 00m 18s) [23:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:04] (03CR) 10Catrope: [C: 032] Enable Flow opt-in on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242035 (owner: 10Catrope) [23:36:31] (03Merged) 10jenkins-bot: Enable Flow opt-in on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242035 (owner: 10Catrope) [23:37:38] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable Flow opt-in on mediawikiwiki (duration: 00m 16s) [23:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:23] (03PS5) 10Thcipriani: Add config deployment [tools/scap] - 10https://gerrit.wikimedia.org/r/240292 (https://phabricator.wikimedia.org/T109512) [23:46:46] !log catrope@tin Synchronized php-1.26wmf24/extensions/Echo/: SWAT (duration: 00m 17s) [23:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:48:25] (03PS1) 10Ori.livneh: xenon-log: make retention count apply per-entrypoint [puppet] - 10https://gerrit.wikimedia.org/r/242411 [23:48:55] RECOVERY - puppet last run on db1071 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [23:52:56] (03PS3) 10Dzahn: Modified redirects config concerning outreachwiki aliases [puppet] - 10https://gerrit.wikimedia.org/r/241564 (owner: 10Base) [23:55:06] (03CR) 10Dzahn: [C: 032] Modified redirects config concerning outreachwiki aliases [puppet] - 10https://gerrit.wikimedia.org/r/241564 (owner: 10Base) [23:58:14] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [23:59:50] !log restarting eventlogging so that processors use etcd to pick up shared token with which to consistently hash IPs [23:59:53] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master