[00:00:04] <jouncebot>	 twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T0000). Please do the needful.
[00:08:34] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused
[00:08:53] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: Connection refused
[00:10:36] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[00:10:52] <urandom>	 on that ^^^
[00:11:59] <grrrit-wm>	 (03CR) 10Dzahn: "I have no idea, but somehow i doubt it will cleanup things, i'd expect more that it tries to pull from new repo on top of everything that " [puppet] - 10https://gerrit.wikimedia.org/r/296687 (https://phabricator.wikimedia.org/T139008) (owner: 10Ladsgroup)
[00:13:13] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active
[00:13:22] <wikibugs>	 06Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2600519 (10Dzahn) a:05Dzahn>03None
[00:13:43] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-03-01 14:11:15 +0000 (expires in 181 days)
[00:13:50] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about      Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2600157 (10EBernhardson) I'm able to reproduce this fairly regularly from `nc`, although i need to...
[00:14:02] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.002 second response time on port 9042
[00:26:54] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 737 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5034673 keys - replication_delay is 737
[00:41:47] <wikibugs>	 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 4 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2600579 (10AndyRussG) >>! In T143271#2599100, @Mattflaschen-WMF wrote: > If this affects windo...
[00:42:43] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5024883 keys - replication_delay is 25
[00:58:18] <grrrit-wm>	 (03PS1) 10Gergő Tisza: Do not use $wgExtensionFunctions to set globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) 
[00:59:39] <grrrit-wm>	 (03CR) 1020after4: [C: 031] "As far as I can remember:" [puppet] - 10https://gerrit.wikimedia.org/r/296687 (https://phabricator.wikimedia.org/T139008) (owner: 10Ladsgroup)
[01:00:20] <Jeff_Green>	 !log reboot db1025 for kernel update
[01:00:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:02:07] <grrrit-wm>	 (03CR) 1020after4: "in case it isn't clear, this only affects initial deployment repo setup on the deployment hosts and it should not have any affect on the t" [puppet] - 10https://gerrit.wikimedia.org/r/296687 (https://phabricator.wikimedia.org/T139008) (owner: 10Ladsgroup)
[01:05:13] <icinga-wm>	 RECOVERY - check_recurring_gc_failures_missed on db1025 is OK: OK
[01:10:10] <grrrit-wm>	 (03PS2) 10Gergő Tisza: Do not use $wgExtensionFunctions to set globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) 
[01:24:09] <grrrit-wm>	 (03PS1) 10Legoktm: Enable flake8 on Python 3 [software/service-checker] - 10https://gerrit.wikimedia.org/r/307895 
[01:24:11] <grrrit-wm>	 (03PS1) 10Legoktm: Run tests on Python 3.4 and 3.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/307896 
[01:35:37] <grrrit-wm>	 (03PS9) 1020after4: WIP: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) 
[01:40:06] <icinga-wm>	 PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=829 [critical =325]
[01:42:54] <grrrit-wm>	 (03PS1) 10Dzahn: Revert "archiva: migration class to rsync data to new host" [puppet] - 10https://gerrit.wikimedia.org/r/307900 
[01:43:19] <grrrit-wm>	 (03CR) 10Dzahn: "all is fine, this is just the reminder to remove temp setup after migration is done" [puppet] - 10https://gerrit.wikimedia.org/r/307900 (owner: 10Dzahn)
[01:45:07] <icinga-wm>	 PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=829 [critical =325]
[01:47:09] <icinga-wm>	 ACKNOWLEDGEMENT - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=829 [critical =325] Jeff_Green known problem
[01:47:23] <grrrit-wm>	 (03PS1) 10Dzahn: wikistats: remove $realm check, simplify role [puppet] - 10https://gerrit.wikimedia.org/r/307902 
[01:49:26] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] wikistats: remove $realm check, simplify role [puppet] - 10https://gerrit.wikimedia.org/r/307902 (owner: 10Dzahn)
[02:13:45] <grrrit-wm>	 (03PS1) 10Dzahn: url_downloader: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/307905 
[02:15:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs4004 is CRITICAL: PYBAL CRITICAL - uploadlb_443 - Could not depool server cp4013.ulsfo.wmnet because of too many down!: uploadlb6_443 - Could not depool server cp4005.ulsfo.wmnet because of too many down!
[02:15:21] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:16:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs4002 is CRITICAL: PYBAL CRITICAL - uploadlb6_443 - Could not depool server cp4006.ulsfo.wmnet because of too many down!
[02:18:03] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[02:21:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs4002 is OK: PYBAL OK - All pools are healthy
[02:22:00] <grrrit-wm>	 (03PS1) 10BBlack: depool ulsfo, some kind of issues... [dns] - 10https://gerrit.wikimedia.org/r/307906 
[02:22:12] <mutante>	 i saw the alerts, looked at cp4006 and 4007
[02:22:14] <mutante>	 because they are in icinga
[02:22:15] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] depool ulsfo, some kind of issues... [dns] - 10https://gerrit.wikimedia.org/r/307906 (owner: 10BBlack)
[02:22:22] <mutante>	 they look busy but running
[02:22:28] <bblack>	 !log ulsfo depool
[02:22:33] <mutante>	 thanks bblack
[02:22:34] <chasemp>	 bblack: thanks was just hoping on
[02:22:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:22:54] <bblack>	 thanks for paging :)
[02:22:56] <mutante>	 it was trying to depool them but couldnt
[02:23:10] <robh>	 looks like pybal couldnt depool anymore than it did
[02:23:13] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[02:23:33] <mutante>	 hmm.. back ?
[02:23:33] <bblack>	 I'm assuming the two likely underlying problems here are either the varnish4 install in ulsfo upload cluster, or the recent ulsfo link problems.
[02:23:41] <bblack>	 but either way, depooling ulsfo fixes it for the evening
[02:23:46] <chasemp>	 ok cool
[02:24:52] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp4006 is CRITICAL: Connection timed out
[02:26:05] <bblack>	 26535 vcache    20   0  0.131t 0.080t  87376 S 888.3 43.5   4797:21 varnishd                                                                                                                   
[02:26:09] <bblack>	 27083 vcache    20   0  0.719t 0.086t 0.077t S 653.3 46.6  12498:07 varnishd 
[02:26:16] <bblack>	 those 888 and 653 numbers are %cpu
[02:26:30] <bblack>	 loadavg on cp4006 is at like 66 right now, and I'm sure that's come down since the middle of the problem period
[02:26:34] <bblack>	 so, likely varnish4 issues
[02:27:28] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp4006 is OK: HTTP OK: HTTP/1.1 200 OK - 327 bytes in 0.150 second response time
[02:27:48] <mutante>	 do you restart services when it happens?
[02:28:14] <bblack>	 it doesn't ever happen :)
[02:28:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs4004 is OK: PYBAL OK - All pools are healthy
[02:28:24] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 921 bytes in 1.592 second response time
[02:28:31] <bblack>	 I'm leaving it for ema to do forensics on in the morning.  none of the daemons actually crashed
[02:28:31] <mutante>	 fair :)
[02:28:38] <mutante>	 yep
[02:30:15] <bblack>	 https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Upload+caches+ulsfo&m=cpu_report&s=by+name&mc=2&g=cpu_report
[02:30:26] <bblack>	 locked up in sys%, interesting
[02:30:33] <logmsgbot>	 !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.16) (duration: 12m 21s)
[02:30:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:30:52] <bblack>	 may not be strictly v4's fault, though.  we also left some systemtap stuff running overnight to debug an issue.  it could be stap itself caused a problem.
[02:32:24] <bblack>	 anyways, I assume with the depool we're stable, call/text if not
[02:33:04] <mutante>	 yes, thanks
[02:34:54] <icinga-wm>	 PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures
[02:42:43] <icinga-wm>	 RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[02:46:26] <icinga-wm>	 PROBLEM - MD RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:46:55] <icinga-wm>	 PROBLEM - configured eth on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:47:16] <icinga-wm>	 PROBLEM - DPKG on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:47:35] <icinga-wm>	 PROBLEM - puppet last run on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:47:52] <icinga-wm>	 PROBLEM - SSH on ms-be1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:48:13] <icinga-wm>	 PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[02:48:46] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:49:25] <icinga-wm>	 PROBLEM - swift-container-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:49:25] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:49:26] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:49:45] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:49:45] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:49:45] <icinga-wm>	 PROBLEM - dhclient process on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:49:57] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:50:16] <icinga-wm>	 PROBLEM - Disk space on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:50:16] <icinga-wm>	 PROBLEM - salt-minion processes on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:50:29] <icinga-wm>	 PROBLEM - swift-object-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:50:38] <icinga-wm>	 PROBLEM - swift-account-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:50:38] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:51:06] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:51:07] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:51:07] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:51:19] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:05:20] <logmsgbot>	 !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.17) (duration: 17m 59s)
[03:05:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:12:34] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Sep  1 03:12:34 UTC 2016 (duration 7m 14s)
[03:12:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:35:08] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Cloud.cfg tidying for precise and trusty images [puppet] - 10https://gerrit.wikimedia.org/r/307881 
[03:38:02] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Cloud.cfg tidying for precise and trusty images [puppet] - 10https://gerrit.wikimedia.org/r/307881 (owner: 10Andrew Bogott)
[04:03:44] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[04:06:15] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5031030 keys - replication_delay is 0
[04:12:52] <grrrit-wm>	 (03PS1) 10Legoktm: Properly support 'basePath' [software/service-checker] - 10https://gerrit.wikimedia.org/r/307910 
[04:12:54] <grrrit-wm>	 (03PS1) 10Legoktm: Fix documentation in README [software/service-checker] - 10https://gerrit.wikimedia.org/r/307911 
[04:13:47] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Properly support 'basePath' [software/service-checker] - 10https://gerrit.wikimedia.org/r/307910 (owner: 10Legoktm)
[04:16:05] <grrrit-wm>	 (03CR) 10Legoktm: [C: 04-1] Allow service-checker to read YAML-formatted specs (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/306707 (https://phabricator.wikimedia.org/T136839) (owner: 10Mobrovac)
[04:23:34] <grrrit-wm>	 (03CR) 10Legoktm: "recheck" [software/service-checker] - 10https://gerrit.wikimedia.org/r/307910 (owner: 10Legoktm)
[05:00:00] <grrrit-wm>	 (03PS1) 10Yuvipanda: extdist: Fix typo in comments [puppet] - 10https://gerrit.wikimedia.org/r/307915 
[05:01:02] <grrrit-wm>	 (03PS2) 10Yuvipanda: extdist: Fix typo in comments [puppet] - 10https://gerrit.wikimedia.org/r/307915 
[05:01:07] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] extdist: Fix typo in comments [puppet] - 10https://gerrit.wikimedia.org/r/307915 (owner: 10Yuvipanda)
[05:09:45] <icinga-wm>	 PROBLEM - NTP on ms-be1016 is CRITICAL: NTP CRITICAL: No response from NTP server
[06:20:35] <jynus>	 so db1047 is starting to reduce its replication lag after the alter table
[06:35:20] <wikibugs>	 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2600739 (10jcrespo)
[06:51:24] <wikibugs>	 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2600753 (10jcrespo)
[06:52:03] <wikibugs>	 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2600739 (10jcrespo) I've verified though @wikimedia email that his account is Marostegui.
[07:08:34] <wikibugs>	 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2600760 (10MoritzMuehlenhoff) @Papaul, that setting fixed it, mw2148 could now be reimaged. During the reimaging we'll likely run into further servers with the same problem, so le...
[07:12:10] <wikibugs>	 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2600761 (10MoritzMuehlenhoff) @Papaul: Do the servers need to be powered down to change the setting or can you change that for running hosts as well?
[07:16:46] <wikibugs>	 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2600762 (10jcrespo)
[07:22:15] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active, AS1299/IPv4: Active
[07:23:52] <wikibugs>	 06Operations, 10Ops-Access-Requests: Grant Manuel root on WMF cluster - https://phabricator.wikimedia.org/T144470#2600771 (10jcrespo)
[07:26:23] <wikibugs>	 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Maps-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2513156 (10timautin) Hi,  I'm the developper of the app E-walk: https://play.google.com/store/apps/details?id=com.at.ewalk.free  I'd definitively be interested if we could u...
[07:29:20] <icinga-wm>	 PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 22 probes of 426 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map
[07:29:44] <wikibugs>	 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2600805 (10jcrespo)
[07:31:11] <wikibugs>	 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2600812 (10Marostegui) cloak requested
[07:35:49] <icinga-wm>	 RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 2 probes of 426 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map
[07:45:53] <moritzm>	 !log reimaging mw2116-2119 to jessie
[07:45:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[07:46:50] <wikibugs>	 06Operations, 10ops-eqiad: Broken disk on copper - https://phabricator.wikimedia.org/T144261#2593788 (10elukey) From the dmesg:  ``` [12690812.776141] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [12690812.782847] ata2.00: BMDMA stat 0x24 [12690812.786685] ata2.00: failed command: WRITE DMA [12690...
[07:47:02] <wikibugs>	 06Operations, 10ops-eqiad: Broken disk on copper - https://phabricator.wikimedia.org/T144261#2600835 (10elukey) p:05Triage>03High
[07:52:12] <wikibugs>	 06Operations, 10Ops-Access-Requests: Grant Manuel root on WMF cluster - https://phabricator.wikimedia.org/T144470#2600845 (10Marostegui) I have read and signed https://phabricator.wikimedia.org/L3   "You signed this document on Thu, Sep 1, 09:51."
[07:57:29] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] url_downloader: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/307905 (owner: 10Dzahn)
[07:57:34] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: url_downloader: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/307905 (owner: 10Dzahn)
[07:57:40] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] url_downloader: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/307905 (owner: 10Dzahn)
[07:59:46] <icinga-wm>	 RECOVERY - BGP status on cr1-ulsfo is OK: BGP OK - up: 16, down: 1, shutdown: 0
[08:04:34] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Remove decommissioned host rubidium from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/307917 (https://phabricator.wikimedia.org/T118213) 
[08:06:05] <icinga-wm>	 PROBLEM - DPKG on mw2148 is CRITICAL: Connection refused by host
[08:08:46] <icinga-wm>	 RECOVERY - DPKG on mw2148 is OK: All packages OK
[08:09:42] <grrrit-wm>	 (03CR) 10Ladsgroup: [C: 031] "So it's okay to merge! One of ops should delete the old repo in tin after this merge I guess." [puppet] - 10https://gerrit.wikimedia.org/r/296687 (https://phabricator.wikimedia.org/T139008) (owner: 10Ladsgroup)
[08:14:37] <grrrit-wm>	 (03PS1) 10Elukey: Remove g+w flag to /srv/log/eventlogging to allow proper logrotation [puppet] - 10https://gerrit.wikimedia.org/r/307918 (https://phabricator.wikimedia.org/T132324) 
[08:19:01] <wikibugs>	 06Operations, 10Ops-Access-Requests: Grant Manuel root on WMF cluster - https://phabricator.wikimedia.org/T144470#2600771 (10mark) Approved, proceed.
[08:37:24] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about      Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2600908 (10hashar) Erik mentioned `elastic1028` had puppet disabled and elasticsearch disabed at 18...
[08:40:15] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about      Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2600910 (10hashar) From SAL:  **2016-08-31** ``` lang=irc 18:54 <gehel> shutting down elasticsearch...
[08:42:03] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about      Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2600914 (10hashar) p:05Triage>03Normal
[08:48:26] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations, 10Research-management: Request access to data for WDQS research - https://phabricator.wikimedia.org/T142780#2600924 (10AlexKrauseTUD) @Cmjohnson As said on Monday, 29th of August: AlexKrauseTUD or is it another account w...
[08:48:51] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about      Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2600925 (10hashar) ETCD config is at https://config-master.wikimedia.org/conftool/eqiad/search and...
[08:54:54] <grrrit-wm>	 (03PS1) 10Hashar: noc: link to conftool and wikitech pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 
[08:56:18] <grrrit-wm>	 (03CR) 10Hashar: "That is to have https://noc.wikimedia.org/ to refer to the conftool config files." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 (owner: 10Hashar)
[09:04:29] <godog>	 !log reboot ms-be1016, stuck and nothing on console
[09:04:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:06:08] <moritzm>	 !log installing postgres security updates on labsdb1004/1006/1007
[09:06:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:07:34] <icinga-wm>	 RECOVERY - swift-account-server on ms-be1016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[09:07:56] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-be1016 is OK: OK: nf_conntrack is 4 % full
[09:07:57] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[09:08:05] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[09:08:05] <icinga-wm>	 RECOVERY - Disk space on ms-be1016 is OK: DISK OK
[09:08:15] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1016 is OK: OK - load average: 8.87, 2.64, 0.91
[09:08:15] <icinga-wm>	 RECOVERY - dhclient process on ms-be1016 is OK: PROCS OK: 0 processes with command name dhclient
[09:08:25] <icinga-wm>	 RECOVERY - swift-container-server on ms-be1016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[09:08:35] <icinga-wm>	 RECOVERY - DPKG on ms-be1016 is OK: All packages OK
[09:08:37] <icinga-wm>	 RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller
[09:08:54] <grrrit-wm>	 (03PS1) 10Marostegui: Added Manuel Arostegui (marostegui) to the Ops group [puppet] - 10https://gerrit.wikimedia.org/r/307921 
[09:10:00] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Added Manuel Arostegui (marostegui) to the Ops group [puppet] - 10https://gerrit.wikimedia.org/r/307921 (owner: 10Marostegui)
[09:18:43] <grrrit-wm>	 (03PS1) 10Marostegui: Removed tab and added spaces [puppet] - 10https://gerrit.wikimedia.org/r/307922 
[09:20:58] <grrrit-wm>	 (03PS2) 10Hashar: noc: link to conftool and wikitech pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 
[09:21:06] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about      Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2600972 (10Gehel) >>! In T144450#2600908, @hashar wrote: > Erik mentioned `elastic1028` had puppet...
[09:22:29] <grrrit-wm>	 (03PS3) 10Hashar: noc: link to conftool and wikitech pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 
[09:23:14] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, fixed capitalization/indentation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 (owner: 10Hashar)
[09:24:49] <moritzm>	 !log reimaging mw2163-2166 to jessie
[09:24:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:26:15] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "LGTM from a quick glance, did you run it through the compiler?" [puppet] - 10https://gerrit.wikimedia.org/r/307510 (https://phabricator.wikimedia.org/T144043) (owner: 10Muehlenhoff)
[09:28:30] <grrrit-wm>	 (03Abandoned) 10Jcrespo: Added Manuel Arostegui (marostegui) to the Ops group [puppet] - 10https://gerrit.wikimedia.org/r/307921 (owner: 10Marostegui)
[09:29:36] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about      Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2600975 (10hashar) Sorry I have been misleading.  That link https://config-master.wikimedia.org/pyb...
[09:35:08] <elukey>	 !log reimaging mw2167 -> mw2170 to Jessie
[09:35:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:37:22] <moritzm>	 !log reimaging mw2061-2064 to jessie
[09:37:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:38:31] <grrrit-wm>	 (03PS2) 10Jcrespo: Add Manuel Arostegui (marostegui) cluster access with ops rights [puppet] - 10https://gerrit.wikimedia.org/r/307922 (https://phabricator.wikimedia.org/T144470) (owner: 10Marostegui)
[09:42:21] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[09:44:21] <wikibugs>	 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2600990 (10MoritzMuehlenhoff) @Papaul, @Cmjohnson : I guess you have some kind of checklist for racking new hardware? Could you add an item to check that new servers always have "...
[09:44:58] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[09:51:05] <moritzm>	 !log reimaging mw2200-2203 to jessie
[09:51:09] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:54:44] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2601046 (10fgiunchedi)
[09:55:35] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/307922 (https://phabricator.wikimedia.org/T144470) (owner: 10Marostegui)
[09:56:49] <logmsgbot>	 !log gehel@palladium conftool action : set/pooled=inactive; selector: elastic1044.eqiad.wmnet
[09:56:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:57:33] <logmsgbot>	 !log gehel@palladium conftool action : set/pooled=inactive; selector: elastic1045.eqiad.wmnet
[09:57:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:57:40] <logmsgbot>	 !log gehel@palladium conftool action : set/pooled=inactive; selector: elastic1046.eqiad.wmnet
[09:57:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:58:33] <jynus>	 !log adding marostegui to wmf and ops on wikitech LDAP
[09:58:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:58:40] <wikibugs>	 06Operations, 10MediaWiki-Maintenance-scripts, 06Performance-Team, 10Thumbor: ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#2601048 (10fgiunchedi)
[10:03:36] <jynus>	 I am sorry, but every time I touch the LDAP, the procedure has changed :-/
[10:03:51] <jynus>	 good news is that it makes sense now
[10:10:48] <wikibugs>	 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2601067 (10jcrespo)
[10:11:09] <grrrit-wm>	 (03CR) 10Muehlenhoff: "Yes, PCC output is at http://puppet-compiler.wmflabs.org/3897/" [puppet] - 10https://gerrit.wikimedia.org/r/307510 (https://phabricator.wikimedia.org/T144043) (owner: 10Muehlenhoff)
[10:13:24] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Add Manuel Arostegui (marostegui) cluster access with ops rights [puppet] - 10https://gerrit.wikimedia.org/r/307922 (https://phabricator.wikimedia.org/T144470) (owner: 10Marostegui)
[10:15:13] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db1072 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 796.63 seconds
[10:15:38] <mark>	 and so it begins ;p
[10:15:44] <jynus>	 db1072?
[10:16:02] <jynus>	 no, I think it is a slow slave only, the one where we ca
[10:16:12] <apergos>	 it is, and I'm only doing half of the stubs at a time
[10:16:13] <jynus>	 *n get false positives
[10:16:16] <apergos>	 so
[10:16:29] <jynus>	 do not assume yet it is the dumps
[10:16:30] <apergos>	 i.e. 14 jobs running
[10:17:12] <gehel>	 !log repooled elastic104[456] - T144450
[10:17:12] <apergos>	 just paranoid...
[10:17:13] <stashbot>	 T144450: Elastica warning about      Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450
[10:17:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:17:34] <jynus>	 the but I am not going to lie, it looks like it
[10:17:57] <apergos>	 but we have already done a month's, well two month's run with 14 stubs going and it's been fine
[10:18:05] <apergos>	 so it must be interaction with something else at the same time
[10:18:12] <jynus>	 it is ok, as I said
[10:18:37] <jynus>	 it is the vslow, dumps slave
[10:18:57] <jynus>	 sadly, it is not easy to make icinga aware of mediawiki configuration dynamically
[10:19:06] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about      Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2601086 (10Gehel) 05Open>03Resolved a:03Gehel Issue was related to PyBal not re-doing DNS res...
[10:19:13] <apergos>	 maybe it is worth trying to figure out exactly what combination of queries is responsible
[10:19:31] <jynus>	 so we will just ack it, hope we have a quick solution with no pages next month
[10:19:41] <jynus>	 apergos, sure
[10:19:59] <apergos>	 we know the stubs when the run later int he month don't cause this
[10:19:59] <jynus>	 it is just that we shouls not be too concerned about it
[10:20:02] <apergos>	 right
[10:20:14] <apergos>	 I just hate it because suppose something actually breaks
[10:20:22] <grrrit-wm>	 (03PS4) 10Hashar: noc: link to conftool and wikitech pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 
[10:20:32] <apergos>	 and we won't see it because 'well there's always this problem at the beginning of the month'
[10:20:40] <jynus>	 apergos, we will have to figure out a way
[10:20:53] <hashar>	 godog: thank you for the noc.wm.o review.  I have sent another tiny indentation fix https://gerrit.wikimedia.org/r/#/c/307920/3..4/docroot/noc/index.html
[10:20:59] <apergos>	 well if you open a new ticket, add me to it please
[10:21:00] <apergos>	 sigh
[10:21:02] <hashar>	 godog: if that is a +1 for you I am going to deploy it
[10:21:21] <hashar>	 would saves me from looking at outdated pybal files ]
[10:21:27] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] noc: link to conftool and wikitech pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 (owner: 10Hashar)
[10:21:34] <godog>	 hashar: yup, LGTM thanks!
[10:21:41] <akosiaris>	 hmm I just got the page... quite delayed
[10:21:54] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] noc: link to conftool and wikitech pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 (owner: 10Hashar)
[10:22:22] <grrrit-wm>	 (03Merged) 10jenkins-bot: noc: link to conftool and wikitech pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307920 (owner: 10Hashar)
[10:22:55] <jynus>	 apergos, I think there was some collation script running
[10:23:14] <jynus>	 it just happens that this server is very loaded and mediawiki doesn't wait for it
[10:23:15] <apergos>	 oh that's right, isn't this the uc change?
[10:23:45] <jynus>	 I just need the glue between pooling status/tags and icinga
[10:24:08] <jynus>	 which obvioulsy the good way of doing it is etcd
[10:24:11] <logmsgbot>	 !log hashar@tin Synchronized docroot/noc/index.html: link to conftool and wikitech pages on https://noc.wikimedia.org/ (duration: 00m 47s)
[10:24:15] <jynus>	 but we are not yet there
[10:24:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:24:38] <jynus>	 I accept suggestions for a temporary patch
[10:25:25] <apergos>	 I actually don't mind seeing the false positives as long as we can tell pretty quickly that they are indeed ok to ignore
[10:25:31] <apergos>	 or not
[10:25:58] <apergos>	 I don't have good thoughts on that right now but I will keep it in the back of my mind
[10:26:08] <jynus>	 well, it would be fine, if it wasn't becase it pages
[10:26:19] <apergos>	 yep
[10:26:41] <jynus>	 becase I have yet no way to discern from a "slow slave" and a "SPOF slave"
[10:27:10] <apergos>	 well in our case they are the same thing, sadly
[10:27:16] <apergos>	 for now
[10:27:34] <jynus>	 apergos, while I appreciate dumps slaves
[10:27:44] <jynus>	 these will not cause imediate issues to users
[10:27:46] <apergos>	 no
[10:27:52] <apergos>	 it's the rest of the stuff run over there though
[10:28:15] <jynus>	 no, preciselly, we leave there only things like terbium jobs
[10:28:27] <apergos>	 I thought there was some job queue related stuff that ran on the vslows
[10:28:28] <apergos>	 no?
[10:28:46] <jynus>	 no, jobs are executed on all slaves, load balanced
[10:29:02] <apergos>	 ok, maybe it was just discussion about moving one type of job off
[10:29:05] <jynus>	 I only added a single job temporarily, as a desperate measure
[10:29:10] <apergos>	 ah that was it then
[10:29:11] <jynus>	 yes, I want that
[10:29:20] <jynus>	 but that will be on regular slaves
[10:30:02] <jynus>	 it's complicated; the roles were a good idea, but makes things more complex
[10:30:11] <apergos>	 funny how that works
[10:30:25] <jynus>	 *funny that works
[10:30:37] <apergos>	 :-D
[10:31:06] <jynus>	 so I will check later if this is a 1-time thing because of the script
[10:31:42] <jynus>	 and try to think a way for the selective incinga checks
[10:32:53] <apergos>	 I'm interested in the first half of that, for sure
[10:33:09] <apergos>	 I can probably dial back the number of pages per stub I request at once
[10:33:12] <apergos>	 if we need that
[10:33:20] <apergos>	 so each query would be shorter
[10:34:10] <jynus>	 I think it is combined with the bad cirrus queries
[10:34:21] <jynus>	 apergos, check https://tendril.wikimedia.org/report/slow_queries?host=%5Edb1072&user=&schema=&qmode=eq&query=&hours=1
[10:35:01] <jynus>	 so, yes, I caused a page
[10:35:14] <jynus>	 but on the other side, I think I prevented a production outage
[10:35:34] <jynus>	 apergos, sorry you are affected by this, I had to chose the least of 2 evils
[10:35:47] <apergos>	 I don't mind the page
[10:36:01] <apergos>	 I just mind the idea that my dumps are dragging down the host in combination with other things
[10:36:06] <apergos>	 so that's what I really want to fix
[10:36:08] <jynus>	 and technically, dumps shouldn't be affected by lag
[10:36:35] <apergos>	 I'm looking at the tendril page but the cirrus queries all seem to be pretty short lived in comparison
[10:37:11] <apergos>	 this is one of those cases where I don't know enough to interpret based on the data 
[10:37:33] <jynus>	 yes, but very frequent 7 second queries will overload the server easily
[10:38:40] <icinga-wm>	 PROBLEM - puppet last run on sca2001 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:38:40] <apergos>	 I guess the flaggedrevsstats must be some update job?
[10:39:19] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[10:39:19] <icinga-wm>	 RECOVERY - swift-object-server on ms-be1016 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[10:39:19] <icinga-wm>	 RECOVERY - MD RAID on ms-be1016 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[10:39:39] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[10:39:45] <godog>	 !log reboot ms-be1016, stuck again
[10:39:49] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[10:39:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:40:07] <grrrit-wm>	 (03PS1) 10Niharika29: Test PageAssessments on English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307927 (https://phabricator.wikimedia.org/T142056) 
[10:40:09] <icinga-wm>	 RECOVERY - SSH on ms-be1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0)
[10:40:09] <jynus>	 apergos, yes, probably a terbium maintenace cron
[10:40:38] <apergos>	 I thought those ran during the 15-20th. or maybe those were some other set of special pages
[10:40:39] <apergos>	 grrr
[10:40:50] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[10:41:00] <icinga-wm>	 RECOVERY - salt-minion processes on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[10:41:00] <icinga-wm>	 RECOVERY - configured eth on ms-be1016 is OK: OK - interfaces up
[10:41:20] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[10:41:33] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be1016 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[10:41:39] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[10:42:41] <apergos>	 well the flaggedrevs script runs every two hours
[10:42:44] <apergos>	 presumably that's not it then
[10:43:43] <apergos>	 or I mean not a contributing factor.  I guess I'll look into making the stubs queries shorter 
[10:45:01] <icinga-wm>	 RECOVERY - NTP on ms-be1016 is OK: NTP OK: Offset -0.001034259796 secs
[10:49:59] <moritzm>	 !log installing libidn security updates
[10:50:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:50:23] <wikibugs>	 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Maps-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2513156 (10Pnorman) > bulk download (select an area of the map, a min and a max zoom levels, and downloading each tiles one after one)  Bulk downloading high zoom image tile...
[10:53:49] <icinga-wm>	 PROBLEM - puppet last run on sca2002 is CRITICAL: CRITICAL: Puppet has 1 failures
[11:00:13] <wikibugs>	 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2601152 (10jcrespo)
[11:00:14] <wikibugs>	 06Operations, 10Ops-Access-Requests: Grant Manuel root on WMF cluster - https://phabricator.wikimedia.org/T144470#2601149 (10jcrespo) 05Open>03Resolved a:03jcrespo Marostegui has sudo and has confirmed me he has access to the cluster and mysql.
[11:01:01] <icinga-wm>	 PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[11:01:21] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about      Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2601153 (10hashar) The spam in logstash is gone :]  Well done ops!
[11:01:42] <icinga-wm>	 PROBLEM - puppet last run on sca1002 is CRITICAL: CRITICAL: Puppet has 1 failures
[11:02:14] <wikibugs>	 06Operations, 10Ops-Access-Requests: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2601154 (10jcrespo)
[11:04:54] <hashar>	 elukey: hello, was it you that proposed to push the zuul.Deb package to apt.wm.o ?
[11:05:43] <hashar>	 the .changes lacked the orig.tar.gz  checksum.  And looks like the latest version I have build is still missing it :(
[11:05:49] <hashar>	 eg https://people.wikimedia.org/~hashar/debs/zuul_2.5.0-8-gcbc7f62-wmf2precise1/zuul_2.5.0-8-gcbc7f62-wmf2precise1_amd64.changes
[11:09:16] <wikibugs>	 06Operations, 06Community-Tech, 10wikidiff2, 13Patch-For-Review: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443#2601171 (10MoritzMuehlenhoff) >>! In T140443#2599180, @Legoktm wrote: > Hmm, what about wikitech/silver which is still using PHP5 on trusty?  The old version...
[11:09:55] <elukey>	 I think moritzm mentioned it IIRC but I can help
[11:09:56] <elukey>	 :)
[11:10:27] <grrrit-wm>	 (03PS1) 10Jcrespo: icinga: add Manuel Arostegui to icinga privileged accounts [puppet] - 10https://gerrit.wikimedia.org/r/307928 (https://phabricator.wikimedia.org/T144469) 
[11:11:56] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 04-1] "Not yet on the private repo." [puppet] - 10https://gerrit.wikimedia.org/r/307928 (https://phabricator.wikimedia.org/T144469) (owner: 10Jcrespo)
[11:16:07] <hashar>	 there is something with passing -sa  apparently
[11:16:22] <hashar>	 but cant figure out how git buildpackage  does or does not pass it
[11:19:44] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: site: add prometheus node_exporter to more codfw machines [puppet] - 10https://gerrit.wikimedia.org/r/307929 (https://phabricator.wikimedia.org/T140646) 
[11:22:14] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "Adding respective service owners for notification, we haven't seen any adverse impact on DB hosts in codfw/eqiad though." [puppet] - 10https://gerrit.wikimedia.org/r/307929 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi)
[11:23:40] <grrrit-wm>	 (03PS2) 10Jcrespo: icinga: add Manuel Arostegui to icinga privileged accounts [puppet] - 10https://gerrit.wikimedia.org/r/307928 (https://phabricator.wikimedia.org/T144469) 
[11:26:01] <icinga-wm>	 RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[11:28:44] <grrrit-wm>	 (03PS1) 10Cmjohnson: Relocating elastic1028 from row D to row B. Changing DNS entries to matc:wqh [dns] - 10https://gerrit.wikimedia.org/r/307930 
[11:29:34] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032] Relocating elastic1028 from row D to row B. Changing DNS entries to matc:wqh [dns] - 10https://gerrit.wikimedia.org/r/307930 (owner: 10Cmjohnson)
[11:33:31] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Bump the ABI of our kernel package to 2 [debs/linux44] - 10https://gerrit.wikimedia.org/r/307931 
[11:34:10] <grrrit-wm>	 (03PS1) 10Gilles: Configure Thumbor to use statsd for metrics [puppet] - 10https://gerrit.wikimedia.org/r/307932 (https://phabricator.wikimedia.org/T144478) 
[11:36:00] <gehel>	 log depooling elastic102[1289] - T143685
[11:36:00] <stashbot>	 T143685: Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685
[11:38:25] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] keyholder-proxy/agent: Convert to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/307510 (https://phabricator.wikimedia.org/T144043) (owner: 10Muehlenhoff)
[11:46:08] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032] Configure Thumbor to use statsd for metrics [puppet] - 10https://gerrit.wikimedia.org/r/307932 (https://phabricator.wikimedia.org/T144478) (owner: 10Gilles)
[11:47:25] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032] Bump the ABI of our kernel package to 2 [debs/linux44] - 10https://gerrit.wikimedia.org/r/307931 (owner: 10Muehlenhoff)
[11:49:05] <grrrit-wm>	 (03CR) 10Daniel Kinzler: "Now that Icf71cdb7 is merged, is this still needed?" [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE))
[11:50:21] <grrrit-wm>	 (03CR) 10Marostegui: [C: 032] icinga: add Manuel Arostegui to icinga privileged accounts [puppet] - 10https://gerrit.wikimedia.org/r/307928 (https://phabricator.wikimedia.org/T144469) (owner: 10Jcrespo)
[11:50:26] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Partially Revert "Clarify string in weekly Phabricator Project email" [puppet] - 10https://gerrit.wikimedia.org/r/307935 
[11:51:01] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Partially Revert "Clarify string in weekly Phabricator Project email" [puppet] - 10https://gerrit.wikimedia.org/r/307935 
[11:51:07] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Partially Revert "Clarify string in weekly Phabricator Project email" [puppet] - 10https://gerrit.wikimedia.org/r/307935 (owner: 10Alexandros Kosiaris)
[11:51:26] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: site: add prometheus node_exporter to more codfw machines [puppet] - 10https://gerrit.wikimedia.org/r/307929 (https://phabricator.wikimedia.org/T140646) 
[11:53:46] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 031] icinga: add Manuel Arostegui to icinga privileged accounts [puppet] - 10https://gerrit.wikimedia.org/r/307928 (https://phabricator.wikimedia.org/T144469) (owner: 10Jcrespo)
[11:56:17] <grrrit-wm>	 (03PS3) 10Marostegui: icinga: add Manuel Arostegui to icinga privileged accounts [puppet] - 10https://gerrit.wikimedia.org/r/307928 (https://phabricator.wikimedia.org/T144469) (owner: 10Jcrespo)
[12:04:58] <grrrit-wm>	 (03PS1) 10Cmjohnson: Relocating elastic1021 and 1022 to row C, corresponding dns changes [dns] - 10https://gerrit.wikimedia.org/r/307936 
[12:05:37] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032] Relocating elastic1021 and 1022 to row C, corresponding dns changes [dns] - 10https://gerrit.wikimedia.org/r/307936 (owner: 10Cmjohnson)
[12:12:32] <gehel>	 !log rolling restart of ferm on elasticsearch eqiad cluster to account for moved servers - T143685
[12:12:33] <stashbot>	 T143685: Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685
[12:12:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:24:54] <mafk>	 hello I need an admin at wikitechwiki
[12:24:58] <mafk>	 vandalism in progress
[12:26:22] <mafk>	 Dereckson: ping, you admin at wikitech?
[12:28:53] <grrrit-wm>	 (03CR) 10Hashar: [C: 031] "We have tested on beta cluster mira and deployment-tin instances (both Trusty) by:" [puppet] - 10https://gerrit.wikimedia.org/r/307510 (https://phabricator.wikimedia.org/T144043) (owner: 10Muehlenhoff)
[12:32:03] <mafk>	 AaronSchulz: ping
[12:46:17] <grrrit-wm>	 (03PS4) 10Gehel: elasticsearch - disable cron to clear elasticsearch caches [puppet] - 10https://gerrit.wikimedia.org/r/307779 (https://phabricator.wikimedia.org/T144396) 
[12:46:17] <AaronSchulz>	 mafk: hmm?
[12:46:39] <mafk>	 AaronSchulz: there's a vandal in wikitechwiki, could you please block it?
[12:47:02] <mafk>	 https://wikitech.wikimedia.org/wiki/Special:Contributions/Adam_Hilter
[12:47:40] <AaronSchulz>	 I don't nominally have admin there
[12:48:35] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] elasticsearch - disable cron to clear elasticsearch caches [puppet] - 10https://gerrit.wikimedia.org/r/307779 (https://phabricator.wikimedia.org/T144396) (owner: 10Gehel)
[12:48:46] <mafk>	 ah, well, okay - I though you had
[12:48:51] <mafk>	 sorry
[12:49:52] <p858snake>	 https://wikitech.wikimedia.org/w/index.php?title=Special%3AListUsers&username=&group=sysop&limit=50
[12:50:01] <p858snake>	 legoktm is probably the easiest to poke
[12:50:25] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] elastic102[1289] moved to new racks [puppet] - 10https://gerrit.wikimedia.org/r/307733 (https://phabricator.wikimedia.org/T143685) (owner: 10Gehel)
[12:50:32] <grrrit-wm>	 (03PS2) 10Gehel: elastic102[1289] moved to new racks [puppet] - 10https://gerrit.wikimedia.org/r/307733 (https://phabricator.wikimedia.org/T143685) 
[12:51:23] <mafk>	 or WikiSysop :P
[12:52:30] <AaronSchulz>	 it doesn't seem urgent atm
[12:52:51] <mafk>	 nope
[12:52:55] <AaronSchulz>	 if it was I could op myself with SQL ;)
[12:53:00] <mafk>	 lol
[12:53:22] <mafk>	 userrights at meta doesn't work (as it should be fwiw)
[12:53:55] <logmsgbot>	 !log ema@palladium conftool action : set/pooled=yes; selector: cp4005.ulsfo.wmnet (tags: ['dc=ulsfo', 'cluster=cache_upload', 'service=varnish-be'])
[12:54:00] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:00:04] <jouncebot>	 hashar, Dereckson, addshore, and aude: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T1300). Please do the needful.
[13:00:52] <zeljkof>	 hashar, Dereckson: no eu swat today, right? https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0September.C2.A001
[13:01:35] <logmsgbot>	 !log gehel@palladium conftool action : set/pooled=yes; selector: name=elastic1047.eqiad.wmnet
[13:01:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:06:21] <Dereckson>	 zeljkof: right, nothing to deploy
[13:07:10] <zeljkof>	 until next week then :)
[13:08:43] <grrrit-wm>	 (03PS5) 10Muehlenhoff: keyholder-proxy/agent: Convert to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/307510 (https://phabricator.wikimedia.org/T144043) 
[13:13:27] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032] keyholder-proxy/agent: Convert to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/307510 (https://phabricator.wikimedia.org/T144043) (owner: 10Muehlenhoff)
[13:13:41] <grrrit-wm>	 (03PS1) 10BBlack: package::builder: EXPERIMENTAL=yes support [puppet] - 10https://gerrit.wikimedia.org/r/307943 
[13:16:01] <elukey>	 !log upgrading httpd/apache to 2.4.10-10+deb8u6+wmf2 on mw130[01].eqiad.wmnet
[13:16:07] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:18:20] <grrrit-wm>	 (03PS1) 10Cmjohnson: relocating elastic1029, corresponding dns change [dns] - 10https://gerrit.wikimedia.org/r/307944 
[13:18:45] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032] relocating elastic1029, corresponding dns change [dns] - 10https://gerrit.wikimedia.org/r/307944 (owner: 10Cmjohnson)
[13:22:05] <elukey>	 an snap these are jobrunners
[13:22:22] <icinga-wm>	 PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[13:24:10] <elukey>	 mm any reason why this is happening --^ ?
[13:24:40] <grrrit-wm>	 (03CR) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel)
[13:25:55] <moritzm>	 elukey: you mean the keyholder? that's fixed and a result of my patch https://gerrit.wikimedia.org/r/307510
[13:26:12] <moritzm>	 after a reboot/restart of the keyholder service, the keys need to be readded
[13:26:35] <elukey>	 super missed your patch
[13:26:43] <elukey>	 I was wondering if it was on purpose or not
[13:26:44] <elukey>	 thanks :)
[13:27:23] <icinga-wm>	 RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys.
[13:29:48] <elukey>	 moritzm: in order to "depool" a jobrunner I'd need to stop jobchron right? conftool does not help a lot in this use case
[13:30:15] <logmsgbot>	 !log hashar@tin Synchronized php-1.28.0-wmf.17/includes/page/WikiPage.php: T144484 (duration: 00m 35s)
[13:30:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:30:43] <hashar>	 13:30:15 336 apaches had sync errors
[13:31:17] <elukey>	 wow
[13:31:57] <hashar>	 yeah that is the keyholder that got stopped just when I have hit scap sync-file
[13:32:13] <hashar>	 moritzm is moving it to base::service_unit and rearming the keys
[13:32:21] <elukey>	 super luck
[13:32:22] <elukey>	 :)
[13:33:59] <moritzm>	 hashar: now rearmed
[13:34:22] <hashar>	 moritzm: all good on tin.eqiad.wmnet. Well done!
[13:34:23] <logmsgbot>	 !log hashar@tin Synchronized php-1.28.0-wmf.17/includes/page/WikiPage.php: T144484 (duration: 00m 49s)
[13:34:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:34:27] <moritzm>	 elukey: I think so, I've simply always upgraded jobrunners in small batches
[13:34:31] <hashar>	 that validates the transition to base::service_unit
[13:34:33] <grrrit-wm>	 (03CR) 10Ottomata: [C: 031] Remove g+w flag to /srv/log/eventlogging to allow proper logrotation [puppet] - 10https://gerrit.wikimedia.org/r/307918 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey)
[13:36:36] <elukey>	 mmmm doesn't seem to stop the POST to hhvm
[13:37:34] <elukey>	 ah no that's the jobrunner's job
[13:37:44] <elukey>	 It is a separate entity
[13:39:14] <elukey>	 !log not upgrading mw130[01] since I'd need more info before proceeding 
[13:39:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:39:27] <elukey>	 I need to study them a bit more
[13:39:32] <elukey>	 will proceed with API only
[13:41:16] <bblack>	 !log uploaded openssl-1.1.0-1+wmf1 to jessie-wikimedia/experimental
[13:41:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:42:13] <elukey>	 !log upgrading httpd/apache to 2.4.10-10+deb8u6+wmf2 on mw128[78]
[13:42:18] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:50:04] <grrrit-wm>	 (03PS2) 10Elukey: Remove g+w flag to /srv/log/eventlogging to allow proper logrotation [puppet] - 10https://gerrit.wikimedia.org/r/307918 (https://phabricator.wikimedia.org/T132324) 
[13:51:12] <grrrit-wm>	 (03CR) 10Elukey: [C: 032] Remove g+w flag to /srv/log/eventlogging to allow proper logrotation [puppet] - 10https://gerrit.wikimedia.org/r/307918 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey)
[14:00:56] <grrrit-wm>	 (03PS11) 10Thiemo Mättig (WMDE): Rewrite outdated comment about Gerrit-Phabricator linking [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) 
[14:02:58] <grrrit-wm>	 (03CR) 10Yurik: [C: 031] maps - use new tileshell.js script to notify tilerator of expired tiles [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel)
[14:03:57] <grrrit-wm>	 (03CR) 10Paladox: [C: 031] Rewrite outdated comment about Gerrit-Phabricator linking [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE))
[14:04:59] <grrrit-wm>	 (03PS2) 10BBlack: package::builder: EXPERIMENTAL=yes support [puppet] - 10https://gerrit.wikimedia.org/r/307943 
[14:11:30] <godog>	 !log wipe and reinitialize corrupted xfs on /dev/sdn1 on ms-be1016
[14:11:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:14:32] <wikibugs>	 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2601432 (10Papaul) @MoritzMuehlenhoff I think we need to power down the hosts. For also to be on the safe side to make sure that the settings are saved a applied by the hosts I re...
[14:25:22] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: package::builder: EXPERIMENTAL=yes support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307943 (owner: 10BBlack)
[14:27:28] <mobrovac>	 !log restbase deploy start of 9cca320
[14:27:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:34:27] <icinga-wm>	 RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:34:56] <wikibugs>	 06Operations: Blocked /etc/passwd on sca100[1234] hosts - https://phabricator.wikimedia.org/T144492#2601459 (10jcrespo)
[14:37:55] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "If I understand correctly what would be stored either on swift or local FS would be math renderings, correct? IOW if we lose silver's FS t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307494 (https://phabricator.wikimedia.org/T126338) (owner: 10Dereckson)
[14:39:16] <elukey>	 !log upgrading httpd/apache to 2.4.10-10+deb8u6+wmf2 on mw128[56]
[14:39:21] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:39:34] <wikibugs>	 06Operations: Blocked /etc/passwd on sca100[1234] hosts - https://phabricator.wikimedia.org/T144492#2601480 (10jcrespo) Debugging on IRC: ``` <akosiaris> so this time around it's /etc/passwd that's locked <akosiaris> not /etc/passwd+ <akosiaris> https://github.com/netblue30/firejail/issues/559 <akosiaris> ? <ako...
[14:40:43] <moritzm>	 !log powered down several hosts for hardware maintenance (T142726): mw2087, mw2149-mw2151
[14:40:44] <stashbot>	 T142726: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726
[14:40:48] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:42:25] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on sca1002 is CRITICAL: CRITICAL: Puppet has 1 failures Jcrespo Firejail/kernel issue: https://phabricator.wikimedia.org/T144492
[14:42:25] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on sca2001 is CRITICAL: CRITICAL: Puppet has 1 failures Jcrespo Firejail/kernel issue: https://phabricator.wikimedia.org/T144492
[14:42:25] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on sca2002 is CRITICAL: CRITICAL: Puppet has 1 failures Jcrespo Firejail/kernel issue: https://phabricator.wikimedia.org/T144492
[14:42:34] <mobrovac>	 !log restbase deploy end of 9cca320
[14:42:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:42:39] <icinga-wm>	 RECOVERY - puppet last run on sca1002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[14:42:55] <jynus>	 oh?
[14:43:37] <jynus>	 did someone workaround it?
[14:45:30] <grrrit-wm>	 (03PS3) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) 
[14:47:10] <grrrit-wm>	 (03PS3) 10Dereckson: Enable Math on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307494 (https://phabricator.wikimedia.org/T126338) 
[14:47:28] <icinga-wm>	 RECOVERY - puppet last run on sca2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[14:47:49] <grrrit-wm>	 (03CR) 10Dereckson: "PS3: switched Math file storage to local-backend." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307494 (https://phabricator.wikimedia.org/T126338) (owner: 10Dereckson)
[14:52:08] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2601512 (10jcrespo)
[14:52:25] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2600739 (10jcrespo)
[14:53:58] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] maps - use new tileshell.js script to notify tilerator of expired tiles (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel)
[14:55:08] <icinga-wm>	 PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds.
[14:57:38] <icinga-wm>	 RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller
[14:57:42] <grrrit-wm>	 (03CR) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel)
[14:58:33] <moritzm>	 !log powered down several hosts for hardware maintenance (T142726): mw2099, mw2102, mw2117, mw2163-mw2199
[14:58:34] <stashbot>	 T142726: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726
[14:58:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:59:01] <wikibugs>	 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2601552 (10MoritzMuehlenhoff) Hi Papaul, first batch: These are depooled from the cluster and powered down:  mw2087 mw2099 mw2102 mw2117 mw2149-mw2151 mw2163-mw2199
[15:03:53] <grrrit-wm>	 (03PS1) 10Rush: openstack: tune sqlalchemy pooling behavior [puppet] - 10https://gerrit.wikimedia.org/r/307956 (https://phabricator.wikimedia.org/T144339) 
[15:04:27] <grrrit-wm>	 (03CR) 10Anomie: Do not use $wgExtensionFunctions to set globals (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) (owner: 10Gergő Tisza)
[15:05:15] <grrrit-wm>	 (03PS4) 10Andrew Bogott: Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) 
[15:05:41] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM from swift/production perspective" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307494 (https://phabricator.wikimedia.org/T126338) (owner: 10Dereckson)
[15:06:37] <grrrit-wm>	 (03PS2) 10Rush: openstack: tune sqlalchemy pooling behavior [puppet] - 10https://gerrit.wikimedia.org/r/307956 (https://phabricator.wikimedia.org/T144339) 
[15:07:39] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 031] "Yep, worth a try!" [puppet] - 10https://gerrit.wikimedia.org/r/307956 (https://phabricator.wikimedia.org/T144339) (owner: 10Rush)
[15:08:59] <grrrit-wm>	 (03PS3) 10BBlack: package::builder: EXPERIMENTAL=yes support [puppet] - 10https://gerrit.wikimedia.org/r/307943 
[15:09:18] <grrrit-wm>	 (03PS1) 10Dzahn: point wikipedia.in to 180.179.52.130 [dns] - 10https://gerrit.wikimedia.org/r/307959 
[15:09:35] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: maps - use new tileshell.js script to notify tilerator of expired tiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel)
[15:11:00] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] package::builder: EXPERIMENTAL=yes support [puppet] - 10https://gerrit.wikimedia.org/r/307943 (owner: 10BBlack)
[15:11:27] <grrrit-wm>	 (03Abandoned) 10Mobrovac: RESTBase: disable firejail [puppet] - 10https://gerrit.wikimedia.org/r/300276 (https://phabricator.wikimedia.org/T136957) (owner: 10Mobrovac)
[15:11:28] <wikibugs>	 06Operations: Blocked /etc/passwd on sca100[1234] hosts - https://phabricator.wikimedia.org/T144492#2601459 (10akosiaris) I 've done a   ``` service zotero stop puppet agent -t -v service zotero stop  /usr/bin/gpasswd ops -M filippo,jgreen,bblack,andrew,faidon,rush,oblivian,laner,yuvipanda,dzahn,akosiaris,spring...
[15:11:29] <grrrit-wm>	 (03CR) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel)
[15:11:59] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2601629 (10jcrespo) 05Open>03Resolved a:03jcrespo So, aside from the subtask  T144496, most tasks were completed and checked.  There are some small hanging issues, lik...
[15:13:00] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) (owner: 10Andrew Bogott)
[15:13:05] <grrrit-wm>	 (03PS5) 10Andrew Bogott: Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) 
[15:17:26] <grrrit-wm>	 (03PS4) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) 
[15:18:34] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] maps - use new tileshell.js script to notify tilerator of expired tiles [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel)
[15:19:28] <grrrit-wm>	 (03PS6) 10Andrew Bogott: Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) 
[15:23:17] <icinga-wm>	 PROBLEM - Host wtp2016 is DOWN: PING CRITICAL - Packet loss = 100%
[15:26:19] <grrrit-wm>	 (03PS5) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) 
[15:30:43] <jynus>	 andrewbogott, when you connect, lots of cronspam from labtestservices2001:/etc/dns-floating-ip-updater.py
[15:30:58] <jynus>	 I've already reported the one from stats1002
[15:31:08] <andrewbogott>	 jynus: ok, that's probably a Krenair issue
[15:31:16] <jynus>	 oh, sorry
[15:31:23] <jynus>	 I just assumed it was you
[15:31:28] <Krenair>	 okay, what does it say jynus?
[15:32:27] <andrewbogott>	 https://www.irccloud.com/pastebin/3RXsKQjG/
[15:32:31] <andrewbogott>	 Krenair: ^
[15:32:47] <andrewbogott>	 (but I haven't actually thought about it yet)
[15:33:46] <grrrit-wm>	 (03PS6) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) 
[15:34:56] <chasemp>	 !log reboot nova-compute on labvirt1013 as stuck (no logs, not applying any changes or taking any instruction)
[15:35:00] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:35:10] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures
[15:36:17] <logmsgbot>	 !log gehel@palladium conftool action : set/pooled=yes; selector: name=elastic102[12].eqiad.wmnet
[15:36:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:37:52] <wikibugs>	 06Operations, 06Labs, 06Release-Engineering-Team, 10wikitech.wikimedia.org, 07LDAP: Rename specific account in LDAP, Wikitech and Gerrit - https://phabricator.wikimedia.org/T133968#2601696 (10Sophivorus) Thanks! So @chasemp or @Andrew told you it's not ok to change a user's UID in LDAP?
[15:38:42] <Krenair>	 andrewbogott, jynus: apparently it fails at this project: createprojecttest10
[15:38:51] <Krenair>	 andrewbogott, jynus: wonder why novaadmin can't list servers in that project
[15:39:02] <Krenair>	 <andrewbogott> Krenair: I can just delete the project :)
[15:39:02] <Krenair>	 <andrewbogott> Probably novaadmin isn't a member
[15:39:02] <Krenair>	 <andrewbogott> although I don't know why that would've changed yesterday
[15:39:46] <Krenair>	 As far as I'm concerned, a project that novaadmin can't administrate is a broken project
[15:39:48] <wikibugs>	 06Operations, 10ops-codfw: Broken disk on wtp2016 - https://phabricator.wikimedia.org/T144260#2601719 (10Papaul) 05Open>03Resolved a:05Papaul>03akosiaris This server has basic support which doesn't convert SATA disk replacement. I used one of the spare disks on-site for replacement.
[15:39:49] <Krenair>	 Nothing wrong with my script
[15:40:07] <andrewbogott>	 Krenair: I agree
[15:40:07] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures
[15:40:09] <andrewbogott>	 let me check
[15:40:10] <jynus>	 Krenair, nothing assumed your script was wrong
[15:40:33] <jynus>	 although maybe if it is a cron it shouldn't need to mail to root the errors?
[15:40:43] <Krenair>	 No I know, I'm just saying - this particular issue is not simply my thing
[15:41:20] <jynus>	 my Engrish is getting worse and worse
[15:41:27] <Krenair>	 even though issues with this script could certainly be my thing
[15:42:58] <icinga-wm>	 PROBLEM - Disk space on ms-be1004 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=82%)
[15:43:35] <andrewbogott>	 Krenair: novaadmin was a user but not a projectadmin.  So I added the projectadmin role
[15:43:51] <Krenair>	 interesting
[15:44:11] <andrewbogott>	 I'm not clear on if that means that /all/ the spam will be fixed or just that one that I pasted though
[15:44:14] <Krenair>	 andrewbogott, oh.
[15:44:16] <Krenair>	 so you fixed one
[15:44:22] <Krenair>	 now it fails on the next: createprojecttest11
[15:44:52] <Krenair>	 I actually don't know what the procedure is to create projects, or if there's some script
[15:45:10] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 173 seconds ago with 0 failures
[15:45:21] <andrewbogott>	 I'm going to look for projects where novaadmin only has one role vs. two and delete those projects
[15:45:24] <godog>	 I'll take a look at ms-be1004 shortly
[15:45:36] <andrewbogott>	 probably I made those from the commandline and they are (unsurprisingly) broken as a result
[15:46:58] <jynus>	 finally s3 schema change arrives to production hosts
[15:48:47] <jynus>	 next time someone proposes "INSERT SELECT" cannot be that slow, I will show them the times for a schema change on pagelinks
[15:51:13] <andrewbogott>	 jynus, Krenair, I can't promise that I got all of them but I just deleted many broken projects.  We'll see if we still get any of those cron notices.
[15:51:20] <grrrit-wm>	 (03CR) 10Rush: [C: 032] openstack: tune sqlalchemy pooling behavior [puppet] - 10https://gerrit.wikimedia.org/r/307956 (https://phabricator.wikimedia.org/T144339) (owner: 10Rush)
[15:51:25] <grrrit-wm>	 (03PS3) 10Rush: openstack: tune sqlalchemy pooling behavior [puppet] - 10https://gerrit.wikimedia.org/r/307956 (https://phabricator.wikimedia.org/T144339) 
[15:52:21] <wikibugs>	 06Operations, 10ops-eqiad: rack/setup/deploy puppetmaster100[12] - https://phabricator.wikimedia.org/T143219#2601772 (10Cmjohnson)
[15:52:21] <Krenair>	 andrewbogott, testcreation4 is now the failing project
[15:52:33] <wikibugs>	 06Operations, 10ops-eqiad: ms-be1004.eqiad.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T144499#2601773 (10fgiunchedi) 03NEW
[15:52:36] <grrrit-wm>	 (03CR) 10Rush: [V: 032] openstack: tune sqlalchemy pooling behavior [puppet] - 10https://gerrit.wikimedia.org/r/307956 (https://phabricator.wikimedia.org/T144339) (owner: 10Rush)
[15:52:51] <andrewbogott>	 Krenair: that one should be gone too...
[15:53:18] <icinga-wm>	 RECOVERY - Disk space on ms-be1004 is OK: DISK OK
[15:53:50] <wikibugs>	 06Operations, 10ops-eqiad: ms-be1004.eqiad.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T144499#2601784 (10fgiunchedi) note that the disk is reported as ok by the raid controller, linux however encounters errors while using it
[15:55:30] <grrrit-wm>	 (03PS2) 10Mobrovac: Allow service-checker to read YAML-formatted specs [software/service-checker] - 10https://gerrit.wikimedia.org/r/306707 (https://phabricator.wikimedia.org/T136839) 
[15:55:39] <icinga-wm>	 PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds.
[15:57:03] <grrrit-wm>	 (03CR) 10Mobrovac: Allow service-checker to read YAML-formatted specs (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/306707 (https://phabricator.wikimedia.org/T136839) (owner: 10Mobrovac)
[15:58:07] <icinga-wm>	 RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller
[15:59:01] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] Properly support 'basePath' [software/service-checker] - 10https://gerrit.wikimedia.org/r/307910 (owner: 10Legoktm)
[16:00:04] <jouncebot>	 godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T1600). Please do the needful.
[16:00:04] <jouncebot>	 mobrovac: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[16:00:42] <godog>	 mobrovac: looking at the patches
[16:00:44] * mobrovac is here
[16:00:46] <mobrovac>	 kk godog
[16:02:49] <grrrit-wm>	 (03PS1) 10Ema: Revert "Upgrade upload ulsfo to Varnish 4" [puppet] - 10https://gerrit.wikimedia.org/r/307964 (https://phabricator.wikimedia.org/T144257) 
[16:03:35] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: Change-Prop: Ignore non-main NS titles for Wikidata updates [puppet] - 10https://gerrit.wikimedia.org/r/307791 (owner: 10Mobrovac)
[16:03:43] <mobrovac>	 godog: all patches can go at once, fyi
[16:05:03] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032] Change-Prop: Ignore non-main NS titles for Wikidata updates [puppet] - 10https://gerrit.wikimedia.org/r/307791 (owner: 10Mobrovac)
[16:05:09] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures
[16:05:59] <godog>	 mobrovac: ack, I'll follow the same order as Deployments
[16:06:16] <grrrit-wm>	 (03PS2) 10Ema: Revert "Upgrade upload ulsfo to Varnish 4" [puppet] - 10https://gerrit.wikimedia.org/r/307964 (https://phabricator.wikimedia.org/T144257) 
[16:06:25] <mobrovac>	 kk
[16:06:26] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] Revert "Upgrade upload ulsfo to Varnish 4" [puppet] - 10https://gerrit.wikimedia.org/r/307964 (https://phabricator.wikimedia.org/T144257) (owner: 10Ema)
[16:06:52] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: Change-Prop: Removed unused config properties [puppet] - 10https://gerrit.wikimedia.org/r/306300 (owner: 10Ppchelko)
[16:08:24] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032] Change-Prop: Removed unused config properties [puppet] - 10https://gerrit.wikimedia.org/r/306300 (owner: 10Ppchelko)
[16:08:55] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: Ignore 503 from ORES updates. [puppet] - 10https://gerrit.wikimedia.org/r/307810 (owner: 10Ppchelko)
[16:09:59] <ema>	 !log downgrading cp4006 to varnish 3 T131502
[16:10:00] <stashbot>	 T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502
[16:10:02] <AndyRussG>	 Any ideas on how to check if there was a problem with our ObjectCache infrastructure between 13:00 and 13:40 UTC yesterday? Or ideas about whom to ask?
[16:10:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:10:09] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures
[16:10:16] <grrrit-wm>	 (03CR) 10Addshore: "Sorry for a lack of a response (I have been on vacation)." [puppet] - 10https://gerrit.wikimedia.org/r/303863 (owner: 10Addshore)
[16:11:14] <grrrit-wm>	 (03PS1) 10Jdlrobson: Enable Wikidata descriptions taglines on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307967 (https://phabricator.wikimedia.org/T143345) 
[16:12:08] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032] Ignore 503 from ORES updates. [puppet] - 10https://gerrit.wikimedia.org/r/307810 (owner: 10Ppchelko)
[16:12:16] <grrrit-wm>	 (03PS4) 10Andrew Bogott: Add new ssh key for Addshore [puppet] - 10https://gerrit.wikimedia.org/r/303863 (owner: 10Addshore)
[16:12:33] <godog>	 mobrovac: all merged, can you check?
[16:12:58] <mobrovac>	 k godog, running puppet and restarting
[16:14:09] <grrrit-wm>	 (03PS1) 10Jdlrobson: Do not show Wikidata descriptions on meta or mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307968 
[16:14:11] <grrrit-wm>	 (03PS1) 10Jdlrobson: Enable on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307969 (https://phabricator.wikimedia.org/T143345) 
[16:14:44] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Add new ssh key for Addshore [puppet] - 10https://gerrit.wikimedia.org/r/303863 (owner: 10Addshore)
[16:14:48] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Do not show Wikidata descriptions on meta or mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307968 (owner: 10Jdlrobson)
[16:14:51] <addshore>	 thanks andrewbogott !
[16:15:02] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Enable on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307969 (https://phabricator.wikimedia.org/T143345) (owner: 10Jdlrobson)
[16:15:12] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 178 seconds ago with 0 failures
[16:15:26] <andrewbogott>	 addshore: it'll take ~30 minutes before the key is active everywhere
[16:16:05] <addshore>	 Okay!
[16:17:06] <mobrovac>	 godog: looking good in codfw, proceeding with eqiad
[16:19:29] <godog>	 mobrovac: https://media.giphy.com/media/xKy2w6LehxxHa/giphy.gif
[16:19:45] <mobrovac>	 hahahahaha
[16:21:16] <grrrit-wm>	 (03PS2) 10Jdlrobson: End lazy loading reference experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307435 (https://phabricator.wikimedia.org/T144240) 
[16:21:32] <mobrovac>	 godog: looking good everywhere
[16:21:34] <mobrovac>	 godog: thnx!
[16:21:49] <godog>	 mobrovac: np!
[16:24:06] <ema>	 !log restarting pybal on lvs4002 T134893
[16:24:07] <stashbot>	 T134893: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893
[16:24:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:25:13] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures
[16:26:35] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure, 05Prometheus-metrics-monitoring: deploy prometheus node_exporter and server to deployment-prep - https://phabricator.wikimedia.org/T144502#2601885 (10fgiunchedi)
[16:30:12] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures
[16:30:23] <grrrit-wm>	 (03PS1) 10BBlack: nginx-1.11.3 + openssl-1.1.0 temp hacky build [software/nginx] (wmf-1.11.3) - 10https://gerrit.wikimedia.org/r/307973 
[16:32:43] <Krenair>	 andrewbogott, jynus: I think the issue is resolved now
[16:34:05] <AndyRussG>	 bblack: sorry for the bother, whom or where could I consult to find out if there were issues with our object cache infrastructure (if that ever happens?) yesterday 13:00 - 13:45 UTC? thx in advance!
[16:34:57] <bblack>	 AndyRussG: what is our object cache infrastructure?
[16:35:01] <bblack>	 (in this question)
[16:35:13] <icinga-wm>	 PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures
[16:35:14] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures
[16:35:47] <AndyRussG>	 bblack: whatever is behind MediaWiki's ObjectCache::getMainWANInstance()
[16:35:56] <wikibugs>	 06Operations, 06WMF-NDA-Requests: Phabricator access for Manuel Arostegui to WMF-NDA and Ops-only procurement and other tasks - https://phabricator.wikimedia.org/T144496#2601909 (10Aklapper)
[16:37:01] <bblack>	 I'm not sure what that is tbh
[16:37:07] <wikibugs>	 06Operations, 06WMF-NDA-Requests: Phabricator access for Manuel Arostegui to WMF-NDA and Ops-only procurement and other tasks - https://phabricator.wikimedia.org/T144496#2601603 (10Aklapper) Regarding WMF-NDA membership (such requests go to #WMF-NDA-Requests):  I've added @Marostegui to #WMF-NDA. Steps I perfo...
[16:37:25] <AndyRussG>	 bblack: maybe memcached?
[16:37:39] <grrrit-wm>	 (03PS7) 10Andrew Bogott: Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) 
[16:37:44] <wikibugs>	 06Operations, 06WMF-NDA-Requests: Phabricator access for Manuel Arostegui to WMF-NDA and Ops-only procurement and other tasks - https://phabricator.wikimedia.org/T144496#2601943 (10Aklapper) Regarding access to tasks in private S4: S4 was created in T93760 and that task does not document who can edit the membe...
[16:38:00] <bblack>	 AndyRussG: probably
[16:38:10] <AndyRussG>	 I'm looking at an unexplained bug where this code is suspected of behaving oddly: https://github.com/wikimedia/mediawiki-extensions-CentralNotice/blob/f6abdb6be74bb8620b19918dbff5c8d617f053d9/includes/ChoiceDataProvider.php#L36-L52
[16:38:49] <AndyRussG>	 It could be a bug or maybe also an outage, but I don't know enuf about how that works behind the scenes to know if that makes sense...
[16:38:56] <AndyRussG>	 (https://phabricator.wikimedia.org/T144393)
[16:39:42] <AndyRussG>	 This is as far as I got wrt doc: https://wikitech.wikimedia.org/wiki/Memcached
[16:40:14] <icinga-wm>	 RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[16:40:14] <icinga-wm>	 PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures
[16:40:24] <AndyRussG>	 Maybe I should look for something on ganglia
[16:42:31] <bblack>	 AndyRussG: there doesn't seem to be any big obvious event in basic system stats for eqiad memcached then
[16:43:46] <AndyRussG>	 bblack: K thanks... Yeah I should have remembered Gangia!! That's the right place to look 4 next time? https://ganglia.wikimedia.org/latest/?r=week&cs=08%2F30%2F2016+00%3A00&ce=09%2F01%2F2016+00%3A00&c=Memcached+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name
[16:44:10] <bblack>	 AndyRussG: yeah
[16:44:45] <AndyRussG>	 Oh wait there was something on codfw
[16:44:56] <logmsgbot>	 !log gehel@palladium conftool action : set/pooled=yes; selector: name=elastic1028.eqiad.wmnet
[16:45:00] * Jeff_Green fixing boron...^^^
[16:45:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:45:15] <icinga-wm>	 RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 169 seconds ago with 0 failures
[16:45:22] <grrrit-wm>	 (03PS3) 10Andrew Bogott: Don't attempt to set root user password [labs/private] - 10https://gerrit.wikimedia.org/r/304321 (owner: 10Yuvipanda)
[16:46:19] <AndyRussG>	 though not at the same time
[16:47:04] <AndyRussG>	 https://ganglia.wikimedia.org/latest/?r=custom&cs=08%2F31%2F2016+00%3A00&ce=08%2F31%2F2016+23%3A59&c=Memcached+codfw&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name
[16:47:11] <AndyRussG>	 bblack: ^
[16:48:16] <grrrit-wm>	 (03PS13) 10Gehel: elasticsearch - move relforge to its own role [puppet] - 10https://gerrit.wikimedia.org/r/304067 
[16:48:43] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032 V: 032] Don't attempt to set root user password [labs/private] - 10https://gerrit.wikimedia.org/r/304321 (owner: 10Yuvipanda)
[16:49:42] <ema>	 !log cp4006 repooled after downgrade
[16:49:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:50:15] <bblack>	 AndyRussG: yeah that's showing some event for codfw memcache ~05:30 on the 31st... but codfw memcache/mw isn't used for real traffic at present, right?
[16:50:42] <AndyRussG>	 I have no idea 8p
[16:50:57] <bblack>	 hmm my time was off looking at a zoomed-out view
[16:51:08] <bblack>	 it's more like 03:42 -> 06:30
[16:51:19] <AndyRussG>	 still doesn't match the time of the bug I have (13:00 - 13:40)
[16:51:22] <bd808>	 jouncebot: refresh
[16:51:25] <jouncebot>	 I refreshed my knowledge about deployments.
[16:51:41] <bblack>	 AndyRussG: is that time definitely UTC and not mis-tz-translated somewhere?
[16:51:45] <AndyRussG>	 Maybe there was some maintenance that could have corrupted stuff, even though no issues show up in these graphs?
[16:51:55] <bblack>	 it's possible
[16:52:04] <grrrit-wm>	 (03PS8) 10Andrew Bogott: Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) 
[16:52:04] <AndyRussG>	 definitely UTC
[16:52:17] <bblack>	 what kind of issue happened? I thought the memcache stuff was mostly a true cache, and thus missing things aren't horribly broken
[16:52:52] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] elasticsearch - move relforge to its own role [puppet] - 10https://gerrit.wikimedia.org/r/304067 (owner: 10Gehel)
[16:53:06] <AndyRussG>	 Changes to the settings in the CentralNotice tables wern't reflected in banners/campaigns being served
[16:53:22] <bblack>	 the mysql tables?
[16:53:27] <AndyRussG>	 yeah
[16:53:34] <AndyRussG>	 The CN DB data seems to have been all good (I checked the CN log tables)
[16:53:41] <AndyRussG>	 Also checked about replication issues
[16:53:48] <bblack>	 I'm guessing those happen through some API that takes care of memcache, not as a direct mysql data mod?
[16:54:28] <AndyRussG>	 The admin changes go directly to MySQL, but yeah, the info sent to clients about what campaigns go where goes through memcache
[16:54:35] <AndyRussG>	 ObjectCache
[16:55:04] <AndyRussG>	 Also maybe worthwhile to note that the people making the changes were in Europe? (or at least one of them was)
[16:55:27] <AndyRussG>	 Supposedly the cache is purged every time there is an CN campaign settings change
[16:55:31] <bblack>	 so if the admin changes go direct to mysql, what invalidates the now-stale object cache entries?  I don't think a direct mysql update would.
[16:55:38] <bblack>	 hmmm ok
[16:55:44] <AndyRussG>	 No there's specific code that does that
[16:55:57] <AndyRussG>	 (which could also have a bug... that'd be the other theory)
[16:56:43] <AndyRussG>	 https://github.com/wikimedia/mediawiki-extensions-CentralNotice/blob/f6abdb6be74bb8620b19918dbff5c8d617f053d9/includes/ChoiceDataProvider.php#L18-L22
[16:57:10] <AndyRussG>	 Things started to work after admin users jiggled some unrelated settings
[16:57:56] <AndyRussG>	 Which makes a caching issue on some level seem a feasible explanation
[16:58:26] <icinga-wm>	 PROBLEM - Host mw2199 is DOWN: PING CRITICAL - Packet loss = 100%
[16:58:26] <icinga-wm>	 PROBLEM - Host mw2193 is DOWN: PING CRITICAL - Packet loss = 100%
[16:58:37] <icinga-wm>	 PROBLEM - Host mw2191 is DOWN: PING CRITICAL - Packet loss = 100%
[16:58:37] <icinga-wm>	 PROBLEM - Host mw2192 is DOWN: PING CRITICAL - Packet loss = 100%
[16:58:37] <icinga-wm>	 PROBLEM - Host mw2198 is DOWN: PING CRITICAL - Packet loss = 100%
[16:58:37] <icinga-wm>	 PROBLEM - Host mw2195 is DOWN: PING CRITICAL - Packet loss = 100%
[16:58:37] <icinga-wm>	 PROBLEM - Host mw2197 is DOWN: PING CRITICAL - Packet loss = 100%
[16:58:37] <icinga-wm>	 PROBLEM - Host mw2194 is DOWN: PING CRITICAL - Packet loss = 100%
[16:58:47] <icinga-wm>	 PROBLEM - Host mw2190 is DOWN: PING CRITICAL - Packet loss = 100%
[16:58:47] <icinga-wm>	 PROBLEM - Host mw2196 is DOWN: PING CRITICAL - Packet loss = 100%
[16:58:57] <AndyRussG>	 Yeah it's nothing horrible if the cache data is missing, but it can be bad if wrong cache data is being served
[16:59:12] <moritzm>	 ^fixed, downtime too short
[16:59:17] <icinga-wm>	 PROBLEM - puppet last run on elastic2006 is CRITICAL: CRITICAL: puppet fail
[16:59:58] <gehel>	 ^ elastic2006 is me, checking
[17:00:05] <jouncebot>	 yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T1700). Please do the needful.
[17:00:05] <jouncebot>	 bd808: A patch you scheduled for Services – Graphoid / Parsoid / OCG / Citoid / ORES is about to be deployed. Please be available during the process.
[17:00:15] <yurik>	 nope
[17:00:20] <subbu>	 no parsoid deploy
[17:00:26] <halfak>	 no ORES
[17:00:51] <Niharika>	 Yeah, it's time for my patch! :)
[17:01:50] <grrrit-wm>	 (03PS9) 10Andrew Bogott: Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) 
[17:02:00] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] "yea, it's just a comment line now, of course. but dont wanna restart gerrit service for this, which will happen after merge. so whenever t" [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE))
[17:02:01] <icinga-wm>	 RECOVERY - puppet last run on elastic2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:03:09] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) (owner: 10Andrew Bogott)
[17:03:10] <bd808>	 I've got a striker deploy to do
[17:03:36] <grrrit-wm>	 (03PS1) 10BBlack: text VCL: bad browser redirect: target IE8/XP more [puppet] - 10https://gerrit.wikimedia.org/r/307977 (https://phabricator.wikimedia.org/T118181) 
[17:04:49] <Niharika>	 Oops, my patch is in an hour.
[17:05:57] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] text VCL: bad browser redirect: target IE8/XP more [puppet] - 10https://gerrit.wikimedia.org/r/307977 (https://phabricator.wikimedia.org/T118181) (owner: 10BBlack)
[17:06:02] <grrrit-wm>	 (03PS2) 10BBlack: text VCL: bad browser redirect: target IE8/XP more [puppet] - 10https://gerrit.wikimedia.org/r/307977 (https://phabricator.wikimedia.org/T118181) 
[17:06:07] <grrrit-wm>	 (03CR) 10BBlack: [V: 032] text VCL: bad browser redirect: target IE8/XP more [puppet] - 10https://gerrit.wikimedia.org/r/307977 (https://phabricator.wikimedia.org/T118181) (owner: 10BBlack)
[17:07:39] <wikibugs>	 06Operations: Instead of url forward for wikipedia.in ... add server ip. - https://phabricator.wikimedia.org/T144508#2602033 (10Naveenpf)
[17:08:14] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Define labs_puppet_master for labs instances. [puppet] - 10https://gerrit.wikimedia.org/r/307980 (https://phabricator.wikimedia.org/T142531) 
[17:08:17] <icinga-wm>	 ACKNOWLEDGEMENT - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors Gehel hostgroup missing after the rename of the relforge cluster - gehel
[17:09:36] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Define labs_puppet_master for labs instances. [puppet] - 10https://gerrit.wikimedia.org/r/307980 (https://phabricator.wikimedia.org/T142531) (owner: 10Andrew Bogott)
[17:10:03] <grrrit-wm>	 (03PS1) 10MaxSem: Deploy Kartographer everywhere public [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307981 (https://phabricator.wikimedia.org/T144062) 
[17:10:20] <grrrit-wm>	 (03PS1) 10Cmjohnson: Adding new mgmt dns entries for 3 new fundraising hosts, pay-lvs's frauth1001 [dns] - 10https://gerrit.wikimedia.org/r/307982 
[17:10:54] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032] Adding new mgmt dns entries for 3 new fundraising hosts, pay-lvs's frauth1001 [dns] - 10https://gerrit.wikimedia.org/r/307982 (owner: 10Cmjohnson)
[17:12:09] <bd808>	 !log Updated striker to ac555bd; fixes T144064
[17:12:10] <stashbot>	 T144064: Tool Maintainers badly overcounted - https://phabricator.wikimedia.org/T144064
[17:12:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:14:50] <AndyRussG>	 bblack: sorry 4 the bother still, who might be directly responsible for memcached infrastructure? Or whom should I ask to find out? thx again!!!
[17:15:31] <grrrit-wm>	 (03PS1) 10BBlack: openssl (1.1.0-1+wmf2) experimental; urgency=medium [debs/openssl] (wmf-1.1) - 10https://gerrit.wikimedia.org/r/307983 
[17:16:38] <bblack>	 AndyRussG: well we all are in some sense, for the infra, but I don't understand our mc to that level, and I'm not sure who does.  I would look first to tracking down where things went wrong with purging the stale data before looking for an infra outage.
[17:17:04] <bblack>	 AndyRussG: maybe _joe_ too, but he's on vacation
[17:17:17] <bblack>	 AndyRussG: can the data simply be purged again?
[17:18:02] <AndyRussG>	 bblack: K... Yeah it can be purged as much as we like
[17:18:28] <AndyRussG>	 In the code itself I don't see how it was possible that the cache wasn't purged
[17:18:56] <AndyRussG>	 But someone who knows the infrastructure well might be able to shed some light
[17:19:07] <AndyRussG>	 Maybe it was only purged in one datacenter?
[17:20:46] <grrrit-wm>	 (03PS1) 10Gehel: Create the relforge hostgroups now that relforge cluster has been renamed. [puppet] - 10https://gerrit.wikimedia.org/r/307985 
[17:22:18] <icinga-wm>	 PROBLEM - Host ms-be1022 is DOWN: PING CRITICAL - Packet loss = 100%
[17:22:22] <AndyRussG>	 np waiting until Monday to pester _joe_ I think (unless this starts to happen again :) )
[17:24:00] <grrrit-wm>	 (03PS2) 10Gehel: Create the relforge hostgroups now that relforge cluster has been renamed. [puppet] - 10https://gerrit.wikimedia.org/r/307985 
[17:24:10] <icinga-wm>	 RECOVERY - Host ms-be1022 is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms
[17:25:29] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Create the relforge hostgroups now that relforge cluster has been renamed. [puppet] - 10https://gerrit.wikimedia.org/r/307985 (owner: 10Gehel)
[17:27:44] <icinga-wm>	 PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds.
[17:29:27] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 27.01 seconds
[17:30:40] <grrrit-wm>	 (03PS3) 10Dzahn: Add wikiquote.pl, link to parking [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder)
[17:31:31] <grrrit-wm>	 (03PS1) 10Yuvipanda: Temp. Hack to get tools up and running [labs/private] - 10https://gerrit.wikimedia.org/r/307989 
[17:32:11] <wikibugs>	 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2602170 (10Cmjohnson) Received the new disks from Dell, installed them set to RAID and the new disks are now handled by the BIOS.
[17:32:28] <icinga-wm>	 RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller
[17:33:01] <grrrit-wm>	 (03CR) 10Rush: [C: 031] Temp. Hack to get tools up and running [labs/private] - 10https://gerrit.wikimedia.org/r/307989 (owner: 10Yuvipanda)
[17:33:42] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] Temp. Hack to get tools up and running [labs/private] - 10https://gerrit.wikimedia.org/r/307989 (owner: 10Yuvipanda)
[17:33:44] <grrrit-wm>	 (03CR) 10Dzahn: "We are going to add the domain to our zones so that it exists properly but linked to the "parking" template so no traffic until the redire" [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder)
[17:34:12] <grrrit-wm>	 (03PS4) 10Dzahn: Add wikiquote.pl, link to parking [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder)
[17:34:27] <grrrit-wm>	 (03PS1) 10Andrew Bogott: vmbuilder: Upgrade openssh on firstboot [puppet] - 10https://gerrit.wikimedia.org/r/307991 
[17:38:00] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Add wikiquote.pl, link to parking [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder)
[17:39:35] <grrrit-wm>	 (03PS1) 10Jgreen: flip fundraising-read.wmnet db alias from db1008 to frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/307992 
[17:40:27] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] vmbuilder: Upgrade openssh on firstboot [puppet] - 10https://gerrit.wikimedia.org/r/307991 (owner: 10Andrew Bogott)
[17:44:17] <icinga-wm>	 RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 111, down: 0, dormant: 0, excluded: 2, unused: 0
[17:44:17] <grrrit-wm>	 (03CR) 10Jgreen: [C: 032] flip fundraising-read.wmnet db alias from db1008 to frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/307992 (owner: 10Jgreen)
[17:46:00] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Rename the labs instance setting labs_puppet_master. [puppet] - 10https://gerrit.wikimedia.org/r/307994 
[17:47:25] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 031] "I would be less tempted to override this, I think." [puppet] - 10https://gerrit.wikimedia.org/r/307994 (owner: 10Andrew Bogott)
[17:47:49] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Rename the labs instance setting labs_puppet_master. [puppet] - 10https://gerrit.wikimedia.org/r/307994 (owner: 10Andrew Bogott)
[17:47:57] <grrrit-wm>	 (03Abandoned) 10Jgreen: flip fundraising-read.wmnet db alias from db1008 to frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/307992 (owner: 10Jgreen)
[17:49:47] <grrrit-wm>	 (03PS3) 10Eevans: Simplification of Cassandra Logstash filtering [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval)
[17:51:30] <icinga-wm>	 PROBLEM - Tool Labs instance distribution on labcontrol1001 is CRITICAL: CRITICAL: static class instances not spread out enough
[17:51:30] <icinga-wm>	 PROBLEM - Tool Labs instance distribution on labcontrol1002 is CRITICAL: CRITICAL: static class instances not spread out enough
[17:51:50] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: onboarding Manuel Arostegui in ops - https://phabricator.wikimedia.org/T144469#2602203 (10RobH)
[17:51:52] <wikibugs>	 06Operations, 06WMF-NDA-Requests: Phabricator access for Manuel Arostegui to WMF-NDA and Ops-only procurement and other tasks - https://phabricator.wikimedia.org/T144496#2602200 (10RobH) 05Open>03Resolved a:03RobH RobH added a member for acl*operations-team: Marostegui. Thu, Sep 1, 17:50   I've updated t...
[17:52:23] <grrrit-wm>	 (03PS1) 10Jgreen: flip fundraising-read.wmnet alias from db1008 to frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/307995 
[17:54:19] <icinga-wm>	 PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds.
[17:54:19] <icinga-wm>	 RECOVERY - Tool Labs instance distribution on labcontrol1001 is OK: OK: All critical toollabs instances are spread out enough
[17:54:19] <icinga-wm>	 RECOVERY - Tool Labs instance distribution on labcontrol1002 is OK: OK: All critical toollabs instances are spread out enough
[17:54:49] <grrrit-wm>	 (03CR) 10Jgreen: [C: 032] flip fundraising-read.wmnet alias from db1008 to frdb1001 [dns] - 10https://gerrit.wikimedia.org/r/307995 (owner: 10Jgreen)
[17:55:35] <Jeff_Green>	 !log switching fundraising database reader from db1008 to frdb1001
[17:55:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:57:49] <grrrit-wm>	 (03PS4) 10Eevans: Simplification of Cassandra Logstash filtering [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval)
[17:58:38] <icinga-wm>	 PROBLEM - puppet last run on elastic2020 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:59:38] <icinga-wm>	 RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller
[17:59:44] <gehel>	 ^ elastic2020 checked, transient error, all is fine
[18:00:05] <jouncebot>	 anomie, ostriches, thcipriani, hashar, twentyafterfour, and aude: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T1800). Please do the needful.
[18:00:05] <jouncebot>	 Niharika, dcausse, and jgirault: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[18:00:31] <Niharika>	 Hullo. 
[18:00:34] <dcausse>	 o/
[18:01:37] <icinga-wm>	 RECOVERY - puppet last run on elastic2020 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[18:01:55] <thcipriani>	 I can SWAT today
[18:02:01] <wikibugs>	 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2602231 (10Cmjohnson) @fgiunchedi the new disk showed and I replaced the one that was producing errors...which was /dev/sdb afaik.  Please check and lmk how it goes.
[18:03:11] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307927 (https://phabricator.wikimedia.org/T142056) (owner: 10Niharika29)
[18:03:38] <grrrit-wm>	 (03Merged) 10jenkins-bot: Test PageAssessments on English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307927 (https://phabricator.wikimedia.org/T142056) (owner: 10Niharika29)
[18:04:16] <thcipriani>	 Niharika: patch is live on mw1099 if you have anything to test there
[18:04:36] * Niharika checks
[18:04:45] <grrrit-wm>	 (03CR) 10Eevans: "[Puppet-compiler output here](http://puppet-compiler.wmflabs.org/3911/)" [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval)
[18:04:59] <grrrit-wm>	 (03PS5) 10Eevans: Simplification of Cassandra Logstash filtering [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval)
[18:05:34] <Niharika>	 thcipriani: Looks fine to me. 
[18:05:43] <thcipriani>	 Niharika: ack, syncing everywhere
[18:08:09] <thcipriani>	 is  mw2187.codfw.wmnet having problems?
[18:08:49] <Niharika>	 Is that a question for me?
[18:09:19] <thcipriani>	 Niharika: no, for an opsen/root to check out
[18:09:25] <thcipriani>	 not letting me ssh to that machine from tin
[18:09:29] <Niharika>	 Ah, okay. 
[18:11:28] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:307927|Test PageAssessments on English Wikivoyage (T142056)]] (duration: 04m 54s)
[18:11:29] <stashbot>	 T142056: Test PageAssessements on English Wikivoyage - https://phabricator.wikimedia.org/T142056
[18:11:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:12:55] <thcipriani>	 ^ Niharika I'm going to need to resync that once we get mw2187 working and/or removed from our server list, sorry :\
[18:13:10] <Niharika>	 thcipriani: No worries. Thank you. 
[18:14:23] <legoktm>	 thcipriani: uhh, is it half-deployed right now? I don't think it's a good idea to keep an extension half deployed...could cause weird stuff
[18:15:33] <thcipriani>	 legoktm: you're right. I'll revert locally until this is resolved
[18:19:10] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: REVERT because proxy down SWAT: [[gerrit:307927|Test PageAssessments on English Wikivoyage (T142056)]] (duration: 03m 15s)
[18:19:10] <logmsgbot>	 !log gehel@palladium conftool action : set/pooled=yes; selector: name=elastic1029.eqiad.wmnet
[18:19:11] <stashbot>	 T142056: Test PageAssessements on English Wikivoyage - https://phabricator.wikimedia.org/T142056
[18:19:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:19:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:19:32] <thcipriani>	 ^ should be reverted locally so we're not in a half deployed state
[18:19:55] <grrrit-wm>	 (03PS6) 10Eevans: Simplification of Cassandra Logstash filtering [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval)
[18:21:11] <Dereckson>	 thcipriani: yeah, but now deployed code != repo code
[18:21:57] <thcipriani>	 Dereckson: temporarily, until this is figured out
[18:23:07] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Revert "Pare down the cloud-init commands on precise" [puppet] - 10https://gerrit.wikimedia.org/r/307999 
[18:24:30] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Revert "Pare down the cloud-init commands on precise" [puppet] - 10https://gerrit.wikimedia.org/r/307999 (owner: 10Andrew Bogott)
[18:26:40] <mutante>	 !log mw2187 - powercycled
[18:26:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:29:22] <mutante>	 mw2187 is back up
[18:29:28] <mutante>	 and started its services
[18:30:23] <Dereckson>	 The last Puppet run was at Thu Sep  1 14:30:42 UTC 2016 (238 minutes ago).
[18:32:05] <Dereckson>	 Wwhen servers are rebooted, they automatically retrieve newest code through a scap pull or should we do it manually to avoid inconsistency?
[18:32:26] <bd808>	 they need a scap pull
[18:32:34] <mutante>	 Dereckson: the latter please
[18:32:48] <mutante>	 i am running puppet
[18:32:54] <bd808>	 puppet only does the pull if the directory is completely missing
[18:33:32] <mutante>	 yea, could you please sync it
[18:33:35] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:307927|Test PageAssessments on English Wikivoyage (T142056)]] (duration: 02m 48s)
[18:33:36] <stashbot>	 T142056: Test PageAssessements on English Wikivoyage - https://phabricator.wikimedia.org/T142056
[18:33:40] <mutante>	 looks like it was down for 3h
[18:33:50] <mutante>	 of course it's ok if nothing changed since then
[18:34:09] <thcipriani>	 mutante: uhhh. There are like 32 other servers that may be down. I thought they were just having trouble connecting to that proxy/there was a bug with the scap code
[18:34:42] <mutante>	 the 32 down hosts in icinga are all ACKed
[18:34:47] <mutante>	 just 1 one was not
[18:34:58] <greg-g>	 are they out of dsh?
[18:35:03] <thcipriani>	 no
[18:35:10] <greg-g>	 this again
[18:35:31] <mutante>	 they are all codfw and in scheduled downtimes
[18:35:37] <mutante>	 says icinga
[18:35:46] <greg-g>	 so they should be removed from dsh, yes?
[18:35:53] <greg-g>	 haven't we had this conversation before? :)
[18:35:55] <mutante>	 from Sep 1 to Sep 2
[18:35:59] <kaldari>	 thcipriani: Niharika's checking out (since it's past midnight her time), but I can take over for her for the PageAssessments testing.
[18:36:24] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2602326 (10BBlack) This ticket has dragged on and wavered off-topic considerably.  We've been supporting chapoly ciphers for about a month now, so the base issue here is resolve...
[18:36:30] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2602327 (10BBlack) 05Open>03Resolved a:03BBlack
[18:36:36] <mutante>	 yea, i dunno about that downtime though
[18:36:44] <mutante>	 doesnt say which user added it
[18:36:55] <thcipriani>	 kaldari: ack. Thank you. My understanding right now is that all the servers that don't have the new code are down for maintenance so it should be deployed everywhere.
[18:37:03] <greg-g>	 depooled from pybal should equal no getting in scap's way
[18:37:38] <greg-g>	 so, scap needs to talk to etcd/conftool, which is on our plan :)
[18:37:42] <greg-g>	 (I guess is the right way?)
[18:37:43] <thcipriani>	 fwiw _joe._ is working on a thing that does that/maybe already merged that thing and it's not working as expected
[18:37:52] * greg-g nods
[18:39:27] <kaldari>	 thcipriani: Cool, it does appear to be live on En WikiVoyage (which is our target). I'll test there.
[18:40:02] <mutante>	 thcipriani: do these 32 others block your deployment? or was it just that a scap-proxy was down that blocked you
[18:42:06] <thcipriani>	 mutante: not "blocking" per-se, but if they're in that dsh file then the deployment has to wait for ssh to timeout on those hosts, so a deploy that takes 45 seconds takes 3 minutes
[18:43:02] <thcipriani>	 which is a bummer and freaks me out and is pretty avoidable
[18:43:05] <mutante>	 greg-g: eh.. there are no dsh files anymore actually
[18:43:36] <kaldari>	 thcipriani: Seems to be working well!
[18:43:38] <mutante>	 they are generated and things changed ...
[18:43:47] <mutante>	 looking
[18:44:00] <thcipriani>	 kaldari: awesome, thanks for checking :)
[18:44:03] * greg-g hangs head
[18:44:14] <thcipriani>	 ah, so that thing did merge likely
[18:44:37] <mutante>	 there was that "generate dsh files" ticket
[18:45:13] <mutante>	 https://phabricator.wikimedia.org/T80395  but i meant.. eh
[18:45:23] <mutante>	 this https://phabricator.wikimedia.org/rOPUPaf3047ecdad79125a334c798fba0606ab7fba330
[18:45:40] <mutante>	 goes to hiera
[18:46:00] <thcipriani>	 eh, I'm trying to find the more recent thing from _jo.e_ from like a week ago
[18:46:32] <mutante>	 i found the dsh.yaml file, but i do _not_ see those down hosts in there
[18:47:10] <mutante>	 the hosts that are down/acked in icinga are not in it 
[18:48:08] <thcipriani>	 https://phabricator.wikimedia.org/rOPUP331eb24b94ab2453451d8c1fc511ec6c7f51b2fe
[18:49:00] <mutante>	 conftool-data/nodes/codfw.yaml:  mw2167.codfw.wmnet: [apache2]
[18:49:03] <mutante>	 that
[18:49:12] <thcipriani>	 ah
[18:49:18] <mutante>	 let me remove them there
[18:49:52] <hasharAway>	 if one ever looks for the conftool-data, I have added the link to https://noc.wikimedia.org/ :D
[18:50:33] <thcipriani>	 cool, thanks
[18:50:42] <apergos>	 yeah I added the noc main page to my handy bookmarks, it's lookin pretty good
[18:51:02] <hashar>	 we also had a bunch of servers reimaged over the day
[18:52:26] <thcipriani>	 hrm, looks like they're all enabled: False there https://config-master.wikimedia.org/conftool/codfw/apaches
[18:53:16] <thcipriani>	 maybe tin just needs a puppet run to update the dsh files? Or maybe scap_proxies is not being generated anymore and it just unmanaged now.
[18:53:35] <grrrit-wm>	 (03PS1) 10Dzahn: remove mw2167 thru mw2199 from conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/308004 
[18:56:07] <grrrit-wm>	 (03CR) 10Gergő Tisza: Do not use $wgExtensionFunctions to set globals (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) (owner: 10Gergő Tisza)
[18:57:10] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "they are also all disabled here https://config-master.wikimedia.org/conftool/codfw/apaches to be reverted after maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/308004 (owner: 10Dzahn)
[18:58:04] <wikibugs>	 06Operations, 06Labs: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2602444 (10Andrew) 05Open>03Resolved This is done now.  Passwords are per-project, and located in var/local/labs-root-passwords/ on the labs puppetmaster (currently labcontrol1001).
[18:58:46] <thcipriani>	 wonder if there's something wrong here? https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/templates/dsh/dsh-group.erb#L15
[18:59:14] <thcipriani>	 or possibly where they're defined in https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/scap/dsh.yaml#L19
[18:59:18] <mutante>	 thcipriani: ehmm.. that isnt merged yet, and now i'm not sure 
[18:59:22] <mutante>	 reading wikitech
[18:59:29] <mutante>	 that step is only for actual "decom"
[18:59:49] <mutante>	 but the scheduled downtime is just for 24h
[19:00:04] <jouncebot>	 hashar: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T1900).
[19:00:14] <grrrit-wm>	 (03CR) 10Eevans: [C: 031] "I tested this in deployment-prep, and it seems to work as advertised." [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval)
[19:00:20] <thcipriani>	 what isn't merged yet?
[19:00:35] <mutante>	 to remove the server names from conftool data files in puppet repo
[19:00:51] <mutante>	 (as opposed to just running commands to depool them)
[19:02:16] <hashar>	 o/
[19:02:39] <hashar>	 jouncebot: delay
[19:02:44] <thcipriani>	 mutante: ah yeah I think there's some issue with the dynamic writing of the dsh files, somewhere in one of the files I linked, I'm not sure about removing them either, but this is causing problems :\
[19:03:44] <mutante>	 they are not in dsh.yaml
[19:03:47] <hashar>	 is puppet querying conftool to generate the dsh files ?
[19:03:54] <mutante>	 that's where i was before
[19:04:44] <mutante>	 let's try removing them from conftool-data i guess (even though the docs say that only for permanent decom)
[19:04:57] <thcipriani>	 hashar: yarp https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/templates/dsh/dsh-group.erb#L13
[19:05:00] <mutante>	 unless you can survive like this for 24h
[19:05:09] <mutante>	 then the scheduled downtime would be over
[19:05:21] <hashar>	 I have to do the train deploy to group2 now
[19:06:04] <mutante>	 ok, i submitted https://gerrit.wikimedia.org/r/#/c/308004/
[19:06:21] <thcipriani>	 dcausse: jgirault_ Going to have to push these patches to evening SWAT, sorry for the technical difficulties :(
[19:06:34] <hashar>	 or kills puppet on tin and manually update the dsh files
[19:06:43] <dcausse>	 thcipriani: no problem, thanks!
[19:07:15] <mutante>	 i ran puppet-merge 
[19:07:29] <mutante>	 there was an error running conftool which is started by that :/
[19:08:48] <mutante>	 runs puppet on tin 
[19:08:56] <hashar>	 mutante: based on giuseppe commit 331eb24b94ab2453451d8c1fc511ec6c7f51b2fe  it should skip servers marked 'inactive'
[19:09:08] <mutante>	 codfw: Removing node mw2187.codfw.wmnet from cluster appserver/apache2
[19:09:11] <mutante>	 ERROR:conftool:delete_node Backend error while deleting node: Backend error: The request requires user authentication : Insufficient credentials
[19:09:14] <mutante>	 that's an env thing
[19:09:28] <mutante>	 pupet-merge was supposed to trigger conftool actions
[19:09:32] <jgirault_>	 thcipriani ok, thanks!
[19:09:37] <mutante>	 for "delete_node"
[19:11:16] <mutante>	 hashar: thcipriani: i edited the dsh file in tin directly
[19:11:29] <hashar>	 neat
[19:11:35] <hashar>	 { 'host': 'mw2183.codfw.wmnet', 'weight':20, 'enabled': False }
[19:11:43] <mutante>	 !log tin removing mw2167 thru mw2199 from dsh file manually, re-running puppet
[19:11:44] <hashar>	 hosts.select{ |x| x['value']['pooled'] != 'inactive' }
[19:11:48] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:11:57] <hashar>	 so yeah seems we flag them 'enabled': false
[19:12:09] <hashar>	 but the dsh file is based on 'pooled' == 'inactive'  ? :(
[19:12:14] <thcipriani>	 hashar: yeah but this is the output from confctl which may be different?
[19:12:22] <hashar>	 yeah
[19:12:50] <thcipriani>	 also: none of those dsh files look like they're coming from this function, they don't have comments saying they're managed by puppet or anything
[19:13:06] <mutante>	 running puppet re-adds the hosts, we have to disable puppet on tin
[19:13:20] <mutante>	 until the deploy is over
[19:13:21] <thcipriani>	 oh no, they do, just the proxies and stuff :)
[19:14:11] <mutante>	 !log tin temp. disabled puppet 
[19:14:18] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:14:34] <mutante>	 thcipriani: ok, so, i have to go but i want to help.. i removed them again from /etc/dsh/group/mediawiki-installation directly, disabled puppet
[19:14:37] <mutante>	 hashar: ^
[19:14:44] <mutante>	 once i'm back i would re-enable it
[19:14:57] <mutante>	 just have to pick somebody up right now
[19:15:05] <hashar>	 thcipriani: anything left to do for SWAT ?
[19:15:06] <thcipriani>	 mutante: ok, sounds good, thank you for your help, this should tide us over
[19:15:12] <hashar>	 mutante: thank you !!!!
[19:15:14] <thcipriani>	 hashar: no, I bumped everything
[19:15:18] <mutante>	 be back soon
[19:15:24] <hashar>	 ok rolling the train so
[19:15:35] <hashar>	 which a five years old can do https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Thursday:_group.7B0.2C1.7D_to_all_deploy
[19:19:31] <logmsgbot>	 !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.17
[19:19:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:20:14] <grrrit-wm>	 (03PS1) 10Ottomata: Reportupdater should rsync to ::srv module instead of :www [puppet] - 10https://gerrit.wikimedia.org/r/308008 (https://phabricator.wikimedia.org/T144278) 
[19:20:42] <hashar>	 SELECT /* CategoryMembershipChangeJob::run 127.0.0.1 */ GET_LOCK('CategoryMembershipUpdates:XXXXX', 10) AS lockstatus
[19:20:43] <hashar>	 really
[19:20:53] <hashar>	 I am wondering why we use the database as a lock manager :/
[19:22:47] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Reportupdater should rsync to ::srv module instead of :www [puppet] - 10https://gerrit.wikimedia.org/r/308008 (https://phabricator.wikimedia.org/T144278) (owner: 10Ottomata)
[19:24:57] <wikibugs>	 06Operations, 10Traffic: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523#2602609 (10BBlack)
[19:25:29] <wikibugs>	 06Operations, 10DBA, 06Performance-Team, 10Traffic, and 2 others: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2602624 (10aaron)
[19:25:32] <wikibugs>	 06Operations, 06Performance-Team, 07Availability, 05MW-1.28-release-notes, and 2 others: Audit mysql database class and hhvm binding support of SSL - https://phabricator.wikimedia.org/T136218#2602623 (10aaron) 05Open>03Resolved
[19:36:34] <grrrit-wm>	 (03PS1) 10BBlack: Revert "package::builder: EXPERIMENTAL=yes support" [puppet] - 10https://gerrit.wikimedia.org/r/308010 
[19:36:44] <grrrit-wm>	 (03PS2) 10BBlack: Revert "package::builder: EXPERIMENTAL=yes support" [puppet] - 10https://gerrit.wikimedia.org/r/308010 
[19:36:57] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Revert "package::builder: EXPERIMENTAL=yes support" [puppet] - 10https://gerrit.wikimedia.org/r/308010 (owner: 10BBlack)
[19:39:09] <urandom>	 !log T143226: Clearing repair status restbase1011-c.eqiad.wmnet
[19:39:10] <stashbot>	 T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226
[19:39:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:39:45] <ottomata>	 hashar:  global distributed lock ? :)
[19:44:03] <urandom>	 !log T143226: Clearing repair status: eqiad, rack 'b' nodes
[19:44:04] <stashbot>	 T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226
[19:44:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:45:10] <ottomata>	 hashar:  lemme know how this deploy goes, i'm watching stuff
[19:54:04] <gehel>	 !log reloading ferm rules on elasticsearch eqiad cluster
[19:54:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:58:20] <hashar>	 !log 1.28.0-wmf.17 rolled on group2 and apparently all fine
[19:58:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:58:32] <hashar>	 ottomata: I have pushed it roughly 40 minutes ago and it looks all fine
[20:01:06] <ottomata>	 nice hashar cool
[20:01:12] <ottomata>	 ya looks find from here so far too
[20:01:16] <ottomata>	 Pchelolo:  ^
[20:01:18] <ottomata>	 fine*
[20:01:57] <hashar>	 ottomata: I am still around a bit watching it
[20:02:02] <hashar>	 else poke #wikimedia-releng :]
[20:02:22] <ottomata>	 ok great thanks
[20:08:47] <grrrit-wm>	 (03PS1) 10ArielGlenn: add timeout and related callback to method for running proc without output [dumps] - 10https://gerrit.wikimedia.org/r/308015 
[20:08:49] <grrrit-wm>	 (03PS1) 10ArielGlenn: fix up locking for misc dumps [dumps] - 10https://gerrit.wikimedia.org/r/308016 
[20:09:02] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] add timeout and related callback to method for running proc without output [dumps] - 10https://gerrit.wikimedia.org/r/308015 (owner: 10ArielGlenn)
[20:09:09] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] fix up locking for misc dumps [dumps] - 10https://gerrit.wikimedia.org/r/308016 (owner: 10ArielGlenn)
[20:09:13] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 03Discovery-Wikidata-Query-Service-Sprint: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#2602875 (10Smalyshev)
[20:10:44] <apergos>	 of course. I thought I pep8 that thing
[20:11:12] <apergos>	 oh both are whining.  meeeehh
[20:12:04] <grrrit-wm>	 (03PS1) 10Cmjohnson: Adding mkroetzsch to analytics private data users group [puppet] - 10https://gerrit.wikimedia.org/r/308017 
[20:13:02] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 03Discovery-Wikidata-Query-Service-Sprint: Make puppet generate path config for WDQS nodes - https://phabricator.wikimedia.org/T144537#2602893 (10Smalyshev)
[20:13:13] <grrrit-wm>	 (03PS2) 10ArielGlenn: add timeout and related callback to method for running proc without output [dumps] - 10https://gerrit.wikimedia.org/r/308015 
[20:13:43] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Adding mkroetzsch to analytics private data users group [puppet] - 10https://gerrit.wikimedia.org/r/308017 (owner: 10Cmjohnson)
[20:13:51] <grrrit-wm>	 (03PS2) 10Cmjohnson: Adding mkroetzsch and akrausetud to analytics private data users group [puppet] - 10https://gerrit.wikimedia.org/r/308017 
[20:14:18] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#2602914 (10Smalyshev) a:03Gehel
[20:14:41] <grrrit-wm>	 (03PS2) 10ArielGlenn: fix up locking for misc dumps [dumps] - 10https://gerrit.wikimedia.org/r/308016 
[20:15:10] <grrrit-wm>	 (03PS3) 10Cmjohnson: Adding mkroetzsch and akrausetud to analytics private data users group [puppet] - 10https://gerrit.wikimedia.org/r/308017 
[20:15:50] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 03Discovery-Wikidata-Query-Service-Sprint: Remove /srv/deployment/wdqs/wdqs/rules.log symlink - https://phabricator.wikimedia.org/T144539#2602931 (10Smalyshev)
[20:16:13] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 03Discovery-Wikidata-Query-Service-Sprint: Remove /srv/deployment/wdqs/wdqs/rules.log symlink - https://phabricator.wikimedia.org/T144539#2602931 (10Smalyshev) Low priority - we can live with the link for now, it's just not clean :)
[20:17:11] <grrrit-wm>	 (03PS1) 10Legoktm: Add basic debug logging functionality [software/service-checker] - 10https://gerrit.wikimedia.org/r/308019 
[20:17:13] <grrrit-wm>	 (03PS1) 10Legoktm: Add x-default-query functionality [software/service-checker] - 10https://gerrit.wikimedia.org/r/308020 
[20:17:40] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032] Adding mkroetzsch and akrausetud to analytics private data users group [puppet] - 10https://gerrit.wikimedia.org/r/308017 (owner: 10Cmjohnson)
[20:17:59] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Remove /srv/deployment/wdqs/wdqs/rules.log symlink - https://phabricator.wikimedia.org/T144539#2602963 (10Smalyshev)
[20:19:51] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations, 10Research-management: Request access to data for WDQS research - https://phabricator.wikimedia.org/T142780#2603004 (10Cmjohnson) Patchset has been merged for both users. You should have access at this time. It may take...
[20:34:31] <grrrit-wm>	 (03PS1) 10Mobrovac: service::node: Compile the file holding puppet-controlled vars [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) 
[20:34:58] <wikibugs>	 06Operations, 10DBA, 06Labs, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2603094 (10Neil_P._Quinn_WMF)
[20:35:34] <urandom>	 !log T143226: Clearing repair status: eqiad, rack 'dd' nodes
[20:35:35] <stashbot>	 T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226
[20:35:38] <urandom>	 !log T143226: Clearing repair status: eqiad, rack 'd' nodes
[20:35:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:35:39] <stashbot>	 T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226
[20:35:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:39:50] <grrrit-wm>	 (03CR) 10Anomie: Do not use $wgExtensionFunctions to set globals (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) (owner: 10Gergő Tisza)
[20:40:16] <grrrit-wm>	 (03PS2) 10Mobrovac: service::node: Compile the file holding puppet-controlled vars [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) 
[20:42:37] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations, 10Research-management: Request access to data for WDQS research - https://phabricator.wikimedia.org/T142780#2603127 (10Cmjohnson) 05Open>03Resolved
[20:47:42] <grrrit-wm>	 (03PS1) 10Gehel: wdqs - move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/308023 (https://phabricator.wikimedia.org/T144536) 
[20:47:48] <grrrit-wm>	 (03PS3) 10Gergő Tisza: Do not use $wgExtensionFunctions to set globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) 
[20:48:58] <grrrit-wm>	 (03PS4) 10Gergő Tisza: Do not use $wgExtensionFunctions to set globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) 
[20:49:58] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] wdqs - move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/308023 (https://phabricator.wikimedia.org/T144536) (owner: 10Gehel)
[20:50:18] <grrrit-wm>	 (03CR) 10Gergő Tisza: Do not use $wgExtensionFunctions to set globals (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307893 (https://phabricator.wikimedia.org/T143055) (owner: 10Gergő Tisza)
[20:54:23] <grrrit-wm>	 (03PS2) 10Gehel: wdqs - move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/308023 (https://phabricator.wikimedia.org/T144536) 
[20:54:38] <wikibugs>	 06Operations, 06Labs, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2603156 (10yuvipanda) wikidata-query done.
[20:55:51] <grrrit-wm>	 (03CR) 10Gehel: "Puppet compiler: https://puppet-compiler.wmflabs.org/3914/" [puppet] - 10https://gerrit.wikimedia.org/r/308023 (https://phabricator.wikimedia.org/T144536) (owner: 10Gehel)
[20:58:41] <grrrit-wm>	 (03CR) 10Smalyshev: "modules/wdqs/templates/initscripts/wdqs-updater.systemd.erb and modules/wdqs/templates/initscripts/wdqs-updater.upstart.erb need to be upd" [puppet] - 10https://gerrit.wikimedia.org/r/308023 (https://phabricator.wikimedia.org/T144536) (owner: 10Gehel)
[20:58:55] <mutante>	 !log mw2187 - shut down
[20:59:00] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:59:01] <grrrit-wm>	 (03CR) 10Smalyshev: [C: 04-1] wdqs - move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/308023 (https://phabricator.wikimedia.org/T144536) (owner: 10Gehel)
[21:01:19] <grrrit-wm>	 (03PS1) 10Dzahn: Revert "remove mw2167 thru mw2199 from conftool-data" [puppet] - 10https://gerrit.wikimedia.org/r/308024 
[21:03:26] <grrrit-wm>	 (03PS3) 10Rush: labstore: nfs-exports-daemon on all servers in a cluster [puppet] - 10https://gerrit.wikimedia.org/r/307629 (https://phabricator.wikimedia.org/T126083) 
[21:04:06] <grrrit-wm>	 (03PS4) 10Rush: labstore: run nfs-exports-daemon on all servers in a cluster [puppet] - 10https://gerrit.wikimedia.org/r/307629 (https://phabricator.wikimedia.org/T126083) 
[21:06:50] <grrrit-wm>	 (03CR) 10Madhuvishy: [C: 031] labstore: run nfs-exports-daemon on all servers in a cluster [puppet] - 10https://gerrit.wikimedia.org/r/307629 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush)
[21:11:06] <grrrit-wm>	 (03PS2) 10Dzahn: Revert "remove mw2167 thru mw2199 from conftool-data" [puppet] - 10https://gerrit.wikimedia.org/r/308024 
[21:11:11] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Revert "remove mw2167 thru mw2199 from conftool-data" [puppet] - 10https://gerrit.wikimedia.org/r/308024 (owner: 10Dzahn)
[21:16:43] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 714 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5107082 keys - replication_delay is 714
[21:18:37] <grrrit-wm>	 (03PS10) 1020after4: WIP: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) 
[21:21:26] <urandom>	 !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1007-a
[21:21:27] <stashbot>	 T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226
[21:21:30] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:21:58] <urandom>	 !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1010-a
[21:21:59] <stashbot>	 T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226
[21:22:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:22:20] <grrrit-wm>	 (03PS11) 1020after4: WIP: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) 
[21:22:23] <urandom>	 !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1011-a
[21:22:24] <stashbot>	 T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226
[21:22:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:22:36] <grrrit-wm>	 (03CR) 10MaxSem: WIP: Scap swat command (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4)
[21:24:55] <urandom>	 !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1008-a
[21:24:56] <stashbot>	 T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226
[21:24:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:25:26] <urandom>	 !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1012-a
[21:25:27] <stashbot>	 T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226
[21:25:30] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:25:39] <grrrit-wm>	 (03PS12) 1020after4: WIP: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) 
[21:25:40] <urandom>	 !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1013-a
[21:25:41] <stashbot>	 T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226
[21:25:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:26:05] <grrrit-wm>	 (03CR) 1020after4: WIP: Scap swat command (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4)
[21:28:54] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] openssl (1.1.0-1+wmf2) experimental; urgency=medium [debs/openssl] (wmf-1.1) - 10https://gerrit.wikimedia.org/r/307983 (owner: 10BBlack)
[21:39:27] <grrrit-wm>	 (03PS2) 10Hashar: Use task to run modules spec [puppet] - 10https://gerrit.wikimedia.org/r/307223 
[21:39:54] <grrrit-wm>	 (03CR) 10Mobrovac: "PS2 PCC looking good - https://puppet-compiler.wmflabs.org/3913/" [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) (owner: 10Mobrovac)
[21:41:02] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5062622 keys - replication_delay is 0
[21:41:13] <grrrit-wm>	 (03CR) 10Hashar: "Bah 'rake' has somehow disappeared :(" [puppet] - 10https://gerrit.wikimedia.org/r/307223 (owner: 10Hashar)
[21:41:44] <icinga-wm>	 ACKNOWLEDGEMENT - Host wtp2016 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T144260
[21:42:15] <wikibugs>	 06Operations, 10ops-codfw: Broken disk on wtp2016 - https://phabricator.wikimedia.org/T144260#2603297 (10Dzahn) 05Resolved>03Open re-opening because the host is marked as DOWN in Icinga
[22:00:04] <jouncebot>	 MaxSem: Respected human, time to deploy Kartogrpaher deployment to more wikis (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T2200). Please do the needful.
[22:00:10] <grrrit-wm>	 (03PS5) 10Rush: labstore: run nfs-exports-daemon on all servers in a cluster [puppet] - 10https://gerrit.wikimedia.org/r/307629 (https://phabricator.wikimedia.org/T126083) 
[22:00:10] <MaxSem>	 jouncebot, next
[22:00:10] <jouncebot>	 In 0 hour(s) and 59 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T2300)
[22:00:20] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations, 10Research-management: Request access to data for WDQS research - https://phabricator.wikimedia.org/T142780#2603387 (10leila) thanks a lot, @Cmjohnson.
[22:00:25] <MaxSem>	 lies.
[22:00:34] <MaxSem>	 my window
[22:00:47] <grrrit-wm>	 (03PS1) 10Ppchelko: Change-Prop: Switch to new events. [puppet] - 10https://gerrit.wikimedia.org/r/308077 
[22:00:59] <grrrit-wm>	 (03PS2) 10MaxSem: Deploy Kartographer everywhere public [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307981 (https://phabricator.wikimedia.org/T144062) 
[22:01:16] <greg-g>	 MaxSem: "next" at 3:00 is SWAT, it doesn't have "now"
[22:01:18] <greg-g>	 jouncebot: now
[22:01:35] <MaxSem>	 :P
[22:01:44] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Deploy Kartographer everywhere public [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307981 (https://phabricator.wikimedia.org/T144062) (owner: 10MaxSem)
[22:02:11] <grrrit-wm>	 (03Merged) 10jenkins-bot: Deploy Kartographer everywhere public [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307981 (https://phabricator.wikimedia.org/T144062) (owner: 10MaxSem)
[22:03:13] <MaxSem>	 jgirault, ready? :)
[22:03:40] <MaxSem>	 pulled on mw1099
[22:03:46] <jgirault>	 !!
[22:05:25] <MaxSem>	 ooops
[22:05:38] <MaxSem>	 this enables mapframe too
[22:06:07] <jgirault>	 yeah :/
[22:06:21] * MaxSem scratches head
[22:08:41] <MaxSem>	 > echo json_encode($wgKartographerEnableTags);
[22:08:41] <MaxSem>	 ["mapframe","maplink","maplink"]
[22:09:55] <MaxSem>	 legoktm, is there a way to override the value of an array from extension.json, as opposed to merging?
[22:11:30] <wikibugs>	 06Operations, 10ops-eqiad: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#2497714 (10demon) I think a month is plenty of grace period, let's kill this thing.
[22:13:34] <grrrit-wm>	 (03PS2) 10Ppchelko: Change-Prop: Switch to new events. [puppet] - 10https://gerrit.wikimedia.org/r/308077 
[22:25:09] <legoktm>	 MaxSem: foo => true, bar => false
[22:25:28] <MaxSem>	 legoktm, https://gerrit.wikimedia.org/r/308082
[22:25:53] <MaxSem>	 I'm pretty sure I tested it though
[22:25:59] <MaxSem>	 :confused:
[22:26:14] <legoktm>	 that also works
[22:26:34] <MaxSem>	 can you +2 it then? :p
[22:29:15] <grrrit-wm>	 (03PS1) 10MaxSem: $wgKartographerEnableTags --> $wgKartographerEnableMapFrame [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308084 
[22:29:51] <jgirault>	 MaxSem I can +2 it
[22:29:53] <jgirault>	 looks good
[22:30:09] <MaxSem>	 already reviewed
[22:35:31] <MaxSem>	 dear holy Zuul...
[22:36:06] <grrrit-wm>	 (03PS1) 10BryanDavis: Add a "now" command [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/308086 
[22:36:08] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] $wgKartographerEnableTags --> $wgKartographerEnableMapFrame [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308084 (owner: 10MaxSem)
[22:36:25] <grrrit-wm>	 (03PS1) 10BryanDavis: Use normal messages rather than notices for help [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/308087 
[22:36:35] <grrrit-wm>	 (03Merged) 10jenkins-bot: $wgKartographerEnableTags --> $wgKartographerEnableMapFrame [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308084 (owner: 10MaxSem)
[22:36:47] <grrrit-wm>	 (03CR) 10Dzahn: "yes, desired feature indeed :)" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/308086 (owner: 10BryanDavis)
[22:53:26] <grrrit-wm>	 (03PS1) 10MaxSem: Revert "Deploy Kartographer everywhere public" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308088 (https://phabricator.wikimedia.org/T144062) 
[22:53:30] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Revert "Deploy Kartographer everywhere public" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308088 (https://phabricator.wikimedia.org/T144062) (owner: 10MaxSem)
[22:55:21] <icinga-wm>	 PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: puppet fail
[22:56:07] <grrrit-wm>	 (03PS2) 10MaxSem: Revert "Deploy Kartographer everywhere public" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308088 (https://phabricator.wikimedia.org/T144062) 
[22:56:20] <icinga-wm>	 PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[22:56:27] <grrrit-wm>	 (03CR) 10MaxSem: Revert "Deploy Kartographer everywhere public" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308088 (https://phabricator.wikimedia.org/T144062) (owner: 10MaxSem)
[22:56:31] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Revert "Deploy Kartographer everywhere public" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308088 (https://phabricator.wikimedia.org/T144062) (owner: 10MaxSem)
[22:57:00] <grrrit-wm>	 (03Merged) 10jenkins-bot: Revert "Deploy Kartographer everywhere public" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308088 (https://phabricator.wikimedia.org/T144062) (owner: 10MaxSem)
[23:00:04] <jouncebot>	 RoanKattouw, ostriches, MaxSem, awight, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T2300).
[23:00:04] <jouncebot>	 ebernhardson, Dereckson, and RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:18] <MaxSem>	 1 sec, I'm wrapping up
[23:00:52] <ebernhardson>	 \o
[23:00:59] <jdlrobson>	 where did i go?
[23:01:19] <RoanKattouw>	 jdlrobson: You're not listed at https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160901T2300
[23:01:21] <logmsgbot>	 !log maxsem@tin Synchronized php-1.28.0-wmf.17/extensions/Kartographer: https://gerrit.wikimedia.org/r/#/c/308085/ (duration: 02m 54s)
[23:01:22] <jdlrobson>	 feck i put it in the wrong box
[23:01:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:01:28] <jdlrobson>	 Rrrggg i hate editing this thing
[23:01:32] <RoanKattouw>	 Crap I need to add more changes
[23:01:51] <wikibugs>	 06Operations, 10Wikimedia-Mailing-lists: reset password of winedale-l - https://phabricator.wikimedia.org/T144416#2603647 (10Dzahn) 05Open>03Resolved a:03KFrancis
[23:01:53] <paladox>	 irish
[23:01:58] <jdlrobson>	 moved it
[23:02:19] <jdlrobson>	 RoanKattouw: >https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=819253&oldid=819235
[23:03:37] <mutante>	 @seen Volker_E 
[23:03:37] <wm-bot>	 mutante: I have never seen Volker_E 
[23:03:42] <RoanKattouw>	 Got it
[23:03:48] <mutante>	 wm-bot: he's right here
[23:03:55] <mutante>	 Volker_E: hi?
[23:03:58] <RoanKattouw>	 I can't do the SWAT today BTW because OIT is trying to rescue my laptop from the water I spilled on it
[23:04:07] <RoanKattouw>	 And my SSH keys are on the hard drive that they're trying to recover
[23:04:14] <paladox>	 LOL
[23:04:14] <RoanKattouw>	 I'm using a loaner right now
[23:05:38] <paladox>	 jdlrobson is that irish? :16
[23:06:02] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 02m 53s)
[23:06:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:06:17] <jdlrobson>	 irish? you've lost me.
[23:06:36] <paladox>	 jdlrobson <jdlrobson> feck i put it in the wrong box
[23:06:37] <MaxSem>	 !log That for https://gerrit.wikimedia.org/r/#/c/308084/
[23:06:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:06:45] <MaxSem>	 I'm done, btw
[23:06:49] <paladox>	 I here that from only irish people like mrs brown boys
[23:07:42] <Dereckson>	 MaxSem: RoanKattouw: one of you SWAT?
[23:07:48] <Dereckson>	 23:03:58 < RoanKattouw> I can't do the SWAT today 
[23:07:53] <Dereckson>	 I've the answer
[23:08:06] <RoanKattouw>	 Dereckson: Sorry, I spilled water on my laptop so I don't have my SSH keys today
[23:08:08] <Dereckson>	 Okay, so I can SWAT.
[23:08:11] <RoanKattouw>	 Thanks
[23:10:01] <Dereckson>	 jgirault: ping
[23:10:35] <Dereckson>	 jgirault: if you're there, do you want to reschedule now https://gerrit.wikimedia.org/r/#/c/307965/ ?
[23:12:42] <Dereckson>	 ebernhardson: https://gerrit.wikimedia.org/r/#/c/306552/ is independant of the two other changes, isn't it?
[23:12:53] <ebernhardson>	 Dereckson: right
[23:14:00] <jdlrobson>	 Dereckson: thank you. Note one of mine is a labs change.
[23:14:42] <grrrit-wm>	 (03PS2) 10Dereckson: Disable phrase suggester for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306552 (https://phabricator.wikimedia.org/T143260) (owner: 10EBernhardson)
[23:16:51] <Dereckson>	 jdlrobson: please check tests at https://gerrit.wikimedia.org/r/#/c/307971 and https://gerrit.wikimedia.org/r/#/c/307970
[23:17:00] <Dereckson>	 Currrently V -1 by Jenkins
[23:17:11] <Dereckson>	 (I've done a recheck on both)
[23:17:15] <jdlrobson>	 Dereckson: those are session manager issues Fatal error: Uncaught exception 'DBQueryError' with message ' in /mnt/jenkins-workspace/workspace/mwext-mw-selenium/src/includes/db/Database.php on line 
[23:17:17] <ebernhardson>	 :S
[23:17:18] <jdlrobson>	 so should be unrelated
[23:18:00] <jdlrobson>	 Dereckson: shoot. https://gerrit.wikimedia.org/r/#/c/304306/ is probably needed
[23:19:04] <jdlrobson>	 zero patch will be okay with https://gerrit.wikimedia.org/r/308094
[23:20:14] <Dereckson>	 jdlrobson: "8 patches max" is a limit by window, not by requester
[23:20:30] <ebernhardson>	 lol :)
[23:21:01] <jdlrobson>	 sure. I'm not sure what to do in this situation though. The Zero and lazy images bugs are pretty critical
[23:21:09] <jdlrobson>	 images are not loading for some people and Zero users cannot edit
[23:21:17] <grrrit-wm>	 (03CR) 10Dereckson: [C: 032] Disable phrase suggester for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306552 (https://phabricator.wikimedia.org/T143260) (owner: 10EBernhardson)
[23:21:26] <jdlrobson>	  restore taglines on labs i can do outside window so you can skip that one
[23:21:38] <jdlrobson>	 and "End lazy loading reference experiments" is a nice to have
[23:21:44] <grrrit-wm>	 (03Merged) 10jenkins-bot: Disable phrase suggester for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306552 (https://phabricator.wikimedia.org/T143260) (owner: 10EBernhardson)
[23:21:57] <jdlrobson>	 it will just be anoying to wait another week
[23:22:24] <Dereckson>	 jdlrobson: add them to the calendar
[23:22:30] <jdlrobson>	 Dereckson: they are all there
[23:22:48] <jdlrobson>	 but https://gerrit.wikimedia.org/r/#/c/304306/ may be needed to get around that jenkins issue (although it should be fine to force merge)
[23:23:07] <Dereckson>	 RoanKattouw: an opinion about that ? ^
[23:23:32] <icinga-wm>	 RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[23:23:32] <Dereckson>	 jdlrobson: by the way 307967 is independant of all the remaining patches?
[23:23:40] <grrrit-wm>	 (03PS2) 10Dereckson: Enable Wikidata descriptions taglines on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307967 (https://phabricator.wikimedia.org/T143345) (owner: 10Jdlrobson)
[23:23:43] <jdlrobson>	 they are all independent
[23:23:45] <RoanKattouw>	 If there's a known CI issue that's fixed in master, force-merging in the wmf branch is OK IMO
[23:26:44] <grrrit-wm>	 (03PS1) 10Paladox: Revert change that fixed the diff being cutoff [puppet] - 10https://gerrit.wikimedia.org/r/308095 
[23:27:07] <grrrit-wm>	 (03PS2) 10Paladox: Revert change that fixed the diff being cutoff [puppet] - 10https://gerrit.wikimedia.org/r/308095 
[23:27:21] <Dereckson>	 jdlrobson: for 307967 labs, I wonder if the ideal is only en (like you did) or for all the nowikidatadescriptiontaglines dblist?
[23:27:36] <grrrit-wm>	 (03CR) 10Catrope: [C: 031] "This change caused bugs with contextual commenting" [puppet] - 10https://gerrit.wikimedia.org/r/308095 (owner: 10Paladox)
[23:27:46] <jdlrobson>	 Dereckson: For our purposes it should be enough - its the only one we test on.
[23:27:50] * Dereckson nods.
[23:28:09] <grrrit-wm>	 (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307967 (https://phabricator.wikimedia.org/T143345) (owner: 10Jdlrobson)
[23:28:35] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable Wikidata descriptions taglines on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307967 (https://phabricator.wikimedia.org/T143345) (owner: 10Jdlrobson)
[23:30:48] <Dereckson>	 mutante: could you do again the kludge you did for the train?
[23:31:14] <Dereckson>	 mutante: tin is still blocked by down proxy
[23:31:59] <Dereckson>	 We've 10 patches to sync with a 3 minutes timeout to wait for scap. Yeah.
[23:32:30] <logmsgbot>	 !log dereckson@tin Synchronized wmf-config/InitialiseSettings-labs.php: Enable Wikidata descriptions taglines on labs (T143345, no-op in prod) (duration: 02m 52s)
[23:32:31] <stashbot>	 T143345: Deploy Wikidata descriptions to mobile web stable channel Wikipedias 2nd half - https://phabricator.wikimedia.org/T143345
[23:32:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:34:40] <bd808>	 just take the down proxy out of the list of proxies?
[23:35:48] <bd808>	 its the scap-proxies dsh group that it needs to come out of
[23:36:10] <RoanKattouw>	 bd808: You "just" need root powers for that
[23:36:10] <mutante>	 Dereckson: yes, i can
[23:36:22] <Dereckson>	 thanks
[23:36:22] <mutante>	 well, the dsh part
[23:36:52] <bd808>	 RoanKattouw: *nod* mostly for mutante's benefit rather than starting the host up again
[23:37:05] <mutante>	 well, i was told to shut it down again
[23:37:26] <mutante>	 if you mean the part where i powered up the server
[23:37:26] <bd808>	 yeah, which seems fine. we just need to make scap stop talking to it
[23:38:19] <mutante>	 !log tin stopping puppet
[23:38:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:39:18] <mutante>	 !log tin removed mw2187 from /etc/dsh/group/scap-proxies
[23:39:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:40:01] <Dereckson>	 jdlrobson: okay I've manually merged them
[23:40:17] <Dereckson>	 thanks mutante 
[23:40:46] <mutante>	 i was looking for the other appservers too
[23:40:52] <mutante>	 that we removed earlier
[23:40:57] <mutante>	 but looks like they are gone now
[23:41:04] <mutante>	 try it
[23:41:23] <Dereckson>	 ebernhardson: why do you want 307954 on wmf/1.28.0-wmf.16 by the way?
[23:41:36] <ebernhardson>	 Dereckson: because i tried to ship it this morning and this morning was canceled :P
[23:41:42] <Dereckson>	 we're 1.28.0-wmf.17 only
[23:41:44] <Dereckson>	 ok
[23:41:45] <ebernhardson>	 Dereckson: can revert and not bother with .16
[23:42:10] <Dereckson>	 ebernhardson: yes, please do a revert change for .16
[23:45:10] <Dereckson>	 ebernhardson: 307955 Do not use the suggest reverse field if it's a non local search live on mw1099
[23:46:05] <Dereckson>	 RoanKattouw: your patches are live on mw1099
[23:46:09] <ebernhardson>	 Dereckson: ok sec, seeing if it fixes the broken query
[23:46:48] <RoanKattouw>	 Dereckson: Thanks. I just realized I'll need to install the plugin to connect to mw1099 :/
[23:46:58] <RoanKattouw>	 (Because I'm on a temporary laptop)
[23:47:09] <Dereckson>	 it's a no reboot pluging, that will be quick
[23:47:12] <RoanKattouw>	 Yeah
[23:47:39] <RoanKattouw>	 OK that took like no time at all
[23:47:49] <RoanKattouw>	 26 seconds to Google it, find it, and install it
[23:48:04] <Dereckson>	 jdlrobson: MF and ZeroBanner patches live on mw1099
[23:48:09] <ebernhardson>	 Dereckson: works
[23:48:10] <jdlrobson>	 Dereckson: on it
[23:49:02] <Dereckson>	 ebernhardson: ack'ed
[23:50:01] <RoanKattouw>	 Dereckson: OK, my patches are working. There's one aspect I can't test because it's specific to IE, and there's no X-WM-Debug plugin for IE I think
[23:50:19] <RoanKattouw>	 But I'll test that once it's fully deployed
[23:51:03] <Dereckson>	 ok
[23:51:09] <logmsgbot>	 !log dereckson@tin Synchronized php-1.28.0-wmf.17/extensions/CirrusSearch/includes/Query/FullTextQueryStringQueryBuilder.php: Do not use the suggest reverse field if it's a non local search ([[Gerrit:307955]]) (duration: 00m 48s)
[23:51:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:51:52] <Dereckson>	 RoanKattouw: Duplicate get(): "enwiki:echo:seen:alert:time:15999850" fetched 3 times
[23:51:54] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2167 is CRITICAL: Host mw2167 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime
[23:51:54] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2168 is CRITICAL: Host mw2168 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime
[23:51:54] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2169 is CRITICAL: Host mw2169 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime
[23:51:54] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2170 is CRITICAL: Host mw2170 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime
[23:51:54] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2172 is CRITICAL: Host mw2172 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime
[23:51:54] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2173 is CRITICAL: Host mw2173 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime
[23:51:54] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2174 is CRITICAL: Host mw2174 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime
[23:51:55] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2175 is CRITICAL: Host mw2175 is not in mediawiki-installation dsh group daniel_zahn in scheduled downtime
[23:52:21] <RoanKattouw>	 Dereckson: That should be unrelated, none of my patches are in Echo
[23:52:23] <Dereckson>	 (I see that also when someone tests a non Echo patch on mw1099)
[23:52:32] <RoanKattouw>	 But we should fix that some time
[23:52:36] <RoanKattouw>	 I think there's a task about it somewhere
[23:53:18] <jdlrobson>	 lazy image one is fixed Dereckson 
[23:53:26] <jdlrobson>	 looking into the zero one
[23:53:49] <RoanKattouw>	 Dereckson: https://phabricator.wikimedia.org/T144534
[23:53:50] <jdlrobson>	 Dereckson: turns out zero is near impossible to test
[23:57:33] <logmsgbot>	 !log dereckson@tin Synchronized php-1.28.0-wmf.17/extensions/VisualEditor/lib/ve: Update lib/ve submodule for Ib9bbaccfff9 (duration: 00m 47s)
[23:57:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master